Let's kick off the New Year 2019 with our first BASM Meetup!
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark at scale from Unravel Data and Databricks.
6:30 - 7:00 pm: Social Hour with Food, Drinks, Beer & Wine
7:00 - 7:05 pm: Jules Introduction & Announcements
7:05 - 7:50 pm: Tech Talk from Unravel Data
8:00 - 8:45 pm: Tech Talk from Databricks
8:45 - 9:00 pm: Additional Networking, Q&A
Tech Talk 1: Putting AI to Work on Apache Spark
Presenter: Shivnath Babu
Abstract: Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems.
This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more.
Bio: CTO and Co-Founder at Unravel Data Systems and an adjunct professor of computer science at Duke University. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.
Tech Talk 2: Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark
Presenter: Lu Wang
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference.
In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning.
We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload.
Bio: Lu Wang is a software engineer at Databricks. His main research interests are developing high-performance parallel algorithms for scientific computing and applications. He was actively involved in the development of the Project Hydrogen, Spark Deep Learning pipelines, and Spark MLlib since he joined Databricks. Before Databricks, he was working on parallel multigrid linear solvers on exascale parallel machines for solving the linear systems from reservoir simulations at Lawrence Livermore national laboratory. He received his Ph.D. from the Pennsylvania State University in 2014.