Bay Area Apache Spark Meetup @ HPE/Aruba Networks in Santa Clara


Details
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about using Apache Spark at scale from Hewlett Packard Enterprise (HPE)/Aruba Networks (http://www.arubanetworks.com/) and Databricks (https://databricks.com/).
Thanks to HPE/Aruba Networks for hosting and sponsoring this meetup.
Agenda:
6:00 - 6:30 pm Mingling & Refreshments
6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions
6:40 - 7:20 pm Aruba Networks/HPE Tech Talk 1
7:20 - 8:00 pm Databricks Tech Talk 2
8:00 - 8:30 pm Mingling
Tech-Talk 1: Techniques to Use Pyspark for Data Correlation to Support Scalability and Complexity
Abstract: The correlation of multiple streaming data sources is a difficult problem, especially when data can arrive out-of-order or delayed, and when the correlation logic can be complicated.
In this talk, we describe a generic correlation framework built on PySpark and HDFS that handles these issues. Each data record passes through two stages of processing. In the first, stateless processing is performed, modifying the record and, in some cases, extracting information from it that is stored into a “correlation cache.” In the second step, the transformed records are enriched by information from the correlation cache, and further post-processing can take place. The correlation cache is stored in Parquet files on HDFS because of the high compression ratio and fast read times. According to pluggable logic, the correlation cache is aggregated over time to efficiently handle long-lived information. MessagePack-encoded text files on HDFS are used to make the correlation cache information relevant to a given batch available to the Spark workers. Based on dependencies between data sources and configurable cutoff delays, a best-effort attempt is made to run correlation only when the relevant data is available.
Bio: John Conley started out in theoretical physics before moving to network security. He has held data science and engineering roles at Cisco, Niara, and now HPE/Aruba. He is currently focused on designing and implementing distributed data pipelines for ETL and data correlation, and big data architecture to support queries and analytics at scale.
Tech-Talk 2: Challenging Web-Scale Graph Analytics with Apache Spark
Abstract: Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Bio: Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

Bay Area Apache Spark Meetup @ HPE/Aruba Networks in Santa Clara