Past Meetup

Bay Area Apache Spark Meetup @ Adobe in San Jose, CA

This Meetup is past

241 people went


Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark from Adobe ( and Apache Spark Committer from Databricks (


6:00 - 6:30 pm Mingling & Refreshments
6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions (Jules Damji + Adobe)
6:40 - 7:20 pm Apache Spark at Adobe
7:20 - 8:00 pm Upcoming Apache Spark 2.4: What’s New & Why Should You Care
8:00 - 8:30 pm More Mingling & Networking

Tech-Talk 1: Apache Spark at Adobe


The Adobe Cloud Platform is a multi-tenant, big data stack as a service on the cloud which provides the modern foundation for all the various parts of the Adobe Experience Cloud.

From a data processing perspective, Adobe uses Apache Spark in a variety of scenarios. We will talk about the high-level data architecture, briefly touching on the infrastructure/scale/challenges, and lastly, we will cover how we are leveraging Spark.

As part of the Cloud Platform, we have also built a Query Engine leveraging Spark SQL for ad-hoc data querying. The Query Engine has implemented a PostgreSQL protocol and leverages Akka Streams and the Presto Parser as an abstraction layer around Spark SQL. We will talk about the high-level architecture and talk about the various patches made to Spark SQL such as support for nested column pruning that are critical to our performance needs when accessing data with thousands of nested columns.

Yogesh Natarajan is a senior software engineer in the Data Platform group at Adobe. His interests include building server-side web applications and big data technologies. He has previously worked at Chegg, Yahoo and graduated with a masters from UC Irvine

Andrew is a senior software engineer in the Data Platform group at Adobe. He specializes in building modern, scalable, cloud-based Scala applications.

Tech-Talk 2: Upcoming Apache Spark 2.4: What’s New & Why Should You Care

The upcoming Apache Spark 2.4 release is the fifth release in the 2.x series. This talk will provide an overview of the major features and enhancements in this upcoming release.

* A new scheduling model (Barrier Scheduling) to enable users to properly embed distributed Deep Learning training as a Spark stage to simplify the distributed training workflow.
* 35 high-order functions are added for manipulating arrays/maps in Spark SQL.
* A new native AVRO data source, based on Databricks' spark-avro module.
* PySpark also introduces eager evaluation mode on all operations for teaching and debuggability.
* Spark on K8S adds PySpark and R support and client-mode support.
* Various enhancements in structured streaming. e.g., stateful operators in continuous processing.
* Various performance improvement in built-in data sources. e.g., Parquet nested schema pruning.

Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC

PARKING: All visitors attending the Adobe/Spark Meetup on ET 01 Park Conference room will need to park in the East Tower basement level 1.