Skip to content

Apache Spark 3.0 Deep Dives at Data + AI Summit Europe (Data + AI Online Meetup)

Photo of Guido Oswald
Hosted By
Guido O.
Apache Spark 3.0 Deep Dives at Data + AI Summit Europe (Data + AI Online Meetup)

Details

Jules and Denny will recap keynote highlights, and present personal session picks. Jacek will speak about Spark 3.0 internals, and Scott will discuss structured streaming microservice architectures.

REGISTER NOW (for FREE) on the Data + AI Summit EU site: https://databricks.swoogo.com/dataaisummit-europe-2020

As you register for Summit, there is a step to register for special events. Click the meetups you'd like to attend. Please note that this meetup is only accessible through the Data + AI Summit platform.

Talk 1: Arbitrary Stateful Aggregation in Spark Structured Streaming and Delta Lake

Speaker: Jacek Laskowski
Abstract: While pursuing my understanding of Apache Spark 3.0 and Delta Lake 0.7.0, I noticed a few themes emerge. Customers often ask me to help them with advanced concepts in Spark Structured Streaming and Delta Lake such as Arbitrary Stateful Aggregation and MERGE INTO, respectively. My current understanding is that there is no way to use MERGE INTO without foreachBatch and with flatMapGroupsWithState operator that makes for a very advanced streaming system. I'll share a few discoveries that I'm hoping to find answers to during this meetup. I will share how to use these tools and what problems I've been running into.

Talk 2: Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends

Speaker: Scott Haines
Abstract: As we continue to push the boundaries new methodologies and techniques continue to emerge to handle larger and larger workloads – from real-time processing and aggregation of user / behavioral data, rule-based / conditional distribution of event and metric streams, to almost any data pipeline / lineage problems. These workloads are typical in most modern data platforms and are critical to all operational analytics systems, data storage systems, ML / DL and beyond.

One of the common problems I’ve seen across a lot of companies can be reduced to general data reliability problems. What was a few systems can quickly fan out into a slew of independent components and serving-layers all who need to be scaled up, down or out with zero-downtime to meet the demands of a world hungry for data.

During this technical deep dive, a new mental model will be built up which aims to reinvent how we architect massive, interconnected services using Kafka, Google Protocol Buffers / gRPC, and Parquet/Delta Lake/Spark Structured Streaming. The material presented is based on lessons learned the hard-way while building up a massive real-time insights platform at Twilio where data integrity and stream fault-tolerance is as critical as the services our company provides.

Speaker Bios:

Jacek Laskowski is an IT freelancer specializing in Apache Spark, Delta Lake, Apache Kafka and Kafka Streams. Jacek offers software development and consultancy services with very hands-on in-depth workshops and mentoring. He is best known by "The Internals Of" online books available free at https://books.japila.pl/.

Scott Haines is a full stack engineer with a current focus on real-time analytics and intelligence systems. He works at Twilio, as a Senior Principal Software Engineer on the Voice Insights team, where he helped drive spark adoption, streaming pipeline architectures, and helped to architect and build out a massive stream and batch processing platform. Twitter: https://twitter.com/newfront

Jules Damji is a Developer Advocate at Databricks, and MLflow contributor with 15+ years of experience. He has worked at leading companies, such as Sun Microsystems, Netscape, and Hortonworks, building large-scale distributed systems.

Denny Lee is a Staff Developer Advocate at Databricks and is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

Photo of Cloud Scale Data Science virtual UserGroup (worldwide) group
Cloud Scale Data Science virtual UserGroup (worldwide)
See more events