Designing ETL Pipelines with Structured Streaming and Delta Lake

Bangalore Apache Spark Meetup
Bangalore Apache Spark Meetup
Public group
Location image of event venue

Details

Bangalore Apache Spark Meetup group would like to invite you to a special talk by TD(Tathagata Das), hosted in collaboration with Databricks Inc.

Program

17:00: Registration

17:30 - 18.30: Designing ETL Pipelines with Structured Streaming and Delta Lake by Tathagata Das (Databricks Inc)

Structured Streaming in Apache Spark has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. However, the ability to express the business logic easily solves half of the challenge of building end-to-end streaming pipelines -- along with efficient and reliable storage of generated output. Delta Lake (https://delta.io/) is a new open-source storage format that simplifies data storage by bringing ACID transactions to Apache Spark and big data workloads. Developed by the original creators of Spark SQL and Structured Streaming, Delta Lake supports batch and streaming writes, schema validation and evolution, complex upserts, and scalable metadata handling, thus making the most reliable and scalable way to store structured data. Together, these two systems can make it very easy to build pipelines that provide end-to-end reliability and transactional guarantees.

However, to use them effectively, it is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiple ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource-efficient manner.

This talk will first introduce Delta Lake and its advantages. Then it will discuss a number of common streaming design patterns of using Structured Streaming and Delta Lake effectively. It will teach you to ask the right set of questions when trying the architect your data pipelines.

Speaker: Tathagata Das, is an Apache Spark committer and a member of the PMC. He was the lead developer behind Spark Streaming and currently develops Structured Streaming and Delta Lake. Earlier in life, he was a Research Assistant in Microsoft Research India and a graduate student at the University of California Berkeley.

18.30 - 19.00: Networking

Date: Wednesday 18th December
Time: 17:00-19:00
Location: WeWork, Vaishnavi Signature, 77/1, Marathahalli-Sarjapur Outer Ring Rd, Bellandur, Outer Ring Road. Bangalore, Karnataka[masked]

Note: We are limiting the RSVPs for this event, please RSVP only if you are attending