Bay Area Apache Spark Meetup @ Workday in San Mateo


Details
Happy New Year!
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks on Apache Spark (https://spark.apache.org/) at scale from Workday (https://www.workday.com/) and Databricks (https://databricks.com/).
Agenda:
6:30 - 7:00 pm Mingling & Refreshments
7:00 - 7:10 pm Welcome opening remarks, announcements, acknowledgments, and introductions
7:10 - 7:50 pm Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark
7:50 - 8:30 pm Upcoming Release Apache Spark 2.3: What’s New?
8:30 - 8:45 pm More Mingling & Networking
Tech-Talk 1: Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark.
Abstract: Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. To prepare data for analysis, business users can clean up and transform their datasets in an interactive, modern data prep environment. Thus, Workday Prism Analytics needs to run three types of scalable data processing applications: “always on” query engine and data prep applications, and on-demand batch execution of transformation pipelines. We standardized on Apache Spark and Spark SQL for all three applications, due to its scalability, as well as, flexibility and extensibility of the Catalyst compiler. All applications share much of the compilation and execution code, except for sampling, caching, and result extraction.
In this talk we will, first, introduce Workday Prism Analytics and describe its Spark-based interactive and batch data processing components. We will then describe the data prep transformations, and their compilation into Spark DataFrames, through Spark-SQL Catalyst plans, in both interactive and batch mode. We will focus on some challenges we encountered while compiling and executing complex pipelines and queries. For example, Spark SQL compilation times exceeded execution time for some low-latency queries. And compiled plans grew dangerously for data prep pipelines with multiple self-joins and self-unions. We will describe caching, sampling, and query compilation techniques that allow us to support interactive user experience. Finally, we will conclude with an overview of the open challenges that we plan to tackle in the future.
Bio:
Dr. Andrey Balmin is a Sr. Principal Engineer at Workday, where he is building the self-service Prism Analytics platform. His work on the foundational technology for Prism began at Platfora (which was acquired by Workday). Prior to this, he was a Research Staff Member at IBM Almaden Research Center where he focused on search and query processing of semi-structured and graph-structured data in Data Warehousing and, later, Big Data platforms. He holds a Ph.D. degree in computer science from UC San Diego.
Tech-Talk 2: Upcoming Release of Apache Spark 2.3: What’s New
Abstract:
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Bio: Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC
Parking Instructions: The parking lot is under construction, but visitors may enter via the Madison Ave.

Bay Area Apache Spark Meetup @ Workday in San Mateo