Spark SQL and beyond from the Spark SQL lead developer from the U.S.

This is a past event

118 people went

Location image of event venue

Details

We have a very exciting talk from the Spark SQL lead developer, Michael Armbrust, himself! See details below.

We would like to thank Zendesk and Hortonworks for sponsoring this talk. Zendesk is providing the venue and beer. Hortonworks are providing the pizza.

Doors open: 6pm

Pizza and Beer are provided

Talk start time: 6:30pm

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

In this talk I'll describe the system and how we build an extensible optimizer and execution engine using features from the Scala programming language. I'll also talk about some early results of new initiatives going on at Databricks (Project Tungsten) that we hope will allow us to continue to leverage the developer productivity of the JVM, while achieving bare metal performance during execution.

Biography of Michael Armbrust

Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

7:30 pm: Q & A with Michael Armbrust