Skip to content

Details

Join us for three tech talks on Data Engineering including Spark SQL and Dagster!

Talk One

Title: Faster Spark SQL: Adaptive Query Execution in Databricks by Allison & Maryann
Abstract: Over the years, there has been extensive & continuous effort on improving Spark SQL's query optimizer & planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects & leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.

Adaptive Query Execution, new in Spark 3.0, now looks to tackle such issues by re-optimizing & adjusting query plans based on runtime statistics collected in the process of query execution. This talk is going to introduce the adaptive query execution framework along with a few optimizations it employs to address some major performance challenges the industry faces when using Spark SQL. We will illustrate how these statistics-guided optimizations work to accelerate execution through query examples. Finally, we will share the significant performance improvement we have seen on the TPC-DS benchmark with Adaptive Query Execution.

Talk Two

Title: Apache Spark SQL optimizations for machine learning across internet-sized data by Michael & Wenzhe (https://github.com/mrtong96/spark_2021_talk)
Abstract: Quantcast regularly deals with internet-sized data (100s of billions of events per day) in order to train models that optimize advertising online. For the past 2 years, Quantcast has been investing into spark as the backbone of our new and experimental data processing pipelines. From this work we have learned several Spark SQL optimizations that can make our problems orders of magnitude faster than the naive approach. We will describe how we use these optimizations in our pipelines with examples on sanitized data and include:

  1. Data transformations to minimize query costs.
  2. Leveraging natural features in the data set to efficiently group and process it with pandas UDFs
  3. Employing Low-level optimizations in python using vectorization and JIT for faster Python execution

Talk Three

Title: Introduction, principles and origin of Dagster by Nick
Abstract: Nick will cover the principles and origin of Dagster. Dagster is a new type of workflow engine: a data orchestrator. Moving beyond just managing the ordering & physical execution of data computations, Dagster considers the entire data application lifecycle. Practitioners in Dagster build data-aware dependency graphs designed for local development and testing; deploy those graphs to multi-tenant, cloud-native orchestration engine; and then monitor and observe the data assets produced by those computations.

In this talk, Nick will cover how Dagster differentiates itself across the three stages (dev & test, deploy & execute, monitor & observer) of the application lifecycle. Through a demo and code snippets, the talk aims to show how the Dagit web UI and Dagster programming model can power a variety of data practitioners.

REGISTER NOW (for FREE) on the DAIS21 site: https://databricks.com/dataaisummit/north-america-2021

Speakers

** Nick Schrock is the founder and CEO of Elementl, the company behind Dagster. Previously, Nick worked at Facebook, where he co-created GraphQL.

** Michael Tong is a Machine Learning Engineer at Quantcast. His current projects at Quantcast focus on developing model training pipelines to process petabytes of data to train tens of thousands of models.

** Wenzhe (David) Xu is a Machine Learning Engineer at Quantcast. He has been applying various machine learning techniques to large-scale graphs by utilizing Spark SQL.

** Maryann Xue is a staff software engineer at Databricks, committer & PMC member of Apache Calcite & Apache Phoenix.

** Allison Wang is a software engineer at Databricks, primarily focusing on Spark SQL.

Members are also interested in