Bay Area Apache Spark Meetup @ Intel in Santa Clara

This is a past event

373 people went

Details

Join us for an evening of Bay Area Spark Meetup featuring tech-talks about using Apache Spark for Deep Learning applications from Intel’s Jiao Wang and Sergey Ermolin and Databricks’ Tathagata Das on Structured Streaming in Apache Spark 2.1.

Thanks to our hosts and sponsors Intel (http://www.intel.com) for sponsoring and hosting this meetup.

Agenda:

6:30 - 7:00 pm Mingling & Refreshments

7:00 - 7:10 pm Opening Remarks & Introductions

7:10 - 7:55: pm Intel Tech Talk - 1 from Jiao Wang/Sergey Ermolin

7:55 - 8:00 pm Short Break

8:00 - 8:45 pm Databricks Tech Talk - 2 from Tathagata Das (TD)

8:45 - 9:00 pm Mingling

Intel: Tech -Talk-1: Distributed Deep Learning At Scale on Apache Spark with BigDL

Abstract: Intel recently released BigDL, an open source distributed deep Learning framework for Apache Spark ( https://github.com/intel-analytics/BigDL ). It brings native support for deep learning functionalities to Spark, provides orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch/TensorFlow) with respect to single node Xeon performance, and efficiently scales out deep learning workloads based on the Spark architecture. In addition, it also allows data scientists to perform distributed deep learning analysis on big data using the familiar tools including python, notebook, etc.

In this talk, we will give an introduction to BigDL, show how Big Data users and data scientist can leverage BigDL for their deep learning (such as image recognition, object detection, NLP, etc.) analysis on large amounts of data in a distributed fashion, which allows them to use their Big Data (e.g., Apache Hadoop and Spark) cluster as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Bio: Jiao Wang is a software engineer in the Big Data Technology team at Intel. She is engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.

Sergey Ermolin is a valley’s veteran with a passion for machine learning and artificial intelligence. His interest in neural networks goes back to 1996 when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard at its Santa Clara campus. Sergey is currently a member of Big Data Technologies team at Intel, working on Apache Spark projects. Sergey holds MSEE from Stanford and BS in Physics and Mechanical Engineering from Cal State University, Sacramento

Databricks: Tech-Talk-2: Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in Apache Spark

Abstract: Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.

Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from Kafka/Kinesis.

We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.

Bio: Tathagata Das (TD) is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching about datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.

PARKING: Note that the meetup is in building SC-12. Use the address above in your Google Maps. We have included the exact location in this map: