GraphFrames, Survival Analysis, and SnappyData + Spark


Details
We have ourselves a packed Seattle Spark Meetup session at the Expedia Building this week with three great sessions!
Agenda:
• 18:00-18:20: Networking
• 18:20-18:30: Introductions
• 18:30-20:00: Three sessions (20-30min each)
• 20:00-20:30: Networking
The sessions are:
On-Time Flight Performance with GraphFrames for Apache Spark
Graph structures are a more intuitive approach to many classes of data problems. Whether traversing social networks, restaurant recommendations, or flight paths, it is easier to understand these data problems within the context of graph structures: vertices, edges, and properties. For example, the analysis of flight data is a classic graph problem as airports are represented by vertices and flights are represented by edges. As well, there are numerous properties associated with these flights including but not limited to departure delays, plane type, and carrier.
In this session, we will use GraphFrames (as recently announced in Introducing GraphFrames) within Databricks notebooks to quickly and easily analyze flight performance data organized in graph structures. Because we’re using graph structures, we can easily ask a number of questions that are not as intuitive as tabular structures such as finding structural motifs, airport ranking using PageRank, and shortest paths between cities. GraphFrames leverage the distribution and expression capabilities of the DataFrame API to both simplify your queries and leverage the performance optimizations of the Spark SQL engine. In addition, with GraphFrames, graph analysis is available in Python, Scala, and Java.
Speaker: Denny Lee, Databricks
Using Survival Analysis to Optimize TTL values for Expedia Hotel Search
Yanning Liu's project at Expedia analyzes the pricing and availability history from Expedia Hotel shopping history and use the outcome to perform survival analysis to optimize TTL values for their hotel search cache. The goal is to improve the search cache hit rate while maintaining the cache accuracy within accepted limit. The analysis is performed on spark clusters that are provisioned on AWS and consumes the log data uploaded to AWS S3.
Speaker: Yanning Liu, Expedia
SnappyData + Spark = Real Time Analytics, Machine Learning, Streaming, OLTP
Apache Spark has come a long ways since it began as a faster batch oriented map-reduce solution for Hadoop. With support for streaming, machine learning and most recently SQL, Spark aspires to become a player in the realm of real time analytics at scale. However, much of its underpinnings remain batch oriented and unsuited for highly concurrent OLAP workloads.
In this talk, we will describe
• How SnappyData is innovating and renovating Spark's core underpinnings to make it suitable for real time operational analytics?
• How we helped create a unified platform that supports real time analytics, transactions, streaming and machine learning in a single consistent data store?
• Approximate query processing which is the first real attempt at reducing the time to insights when working with big data
Speaker: Suds Menon is responsible for engineering, marketing, venture funding and juggles all operational aspects of SnappyData.

GraphFrames, Survival Analysis, and SnappyData + Spark