Skip to content

Details

Join us for an online tech talk. Tech talks include slides, a demo + Q&A.

Abstract:
While it is common to use Delta Lake as a sink for change data captured from traditional data sources; customers are increasingly asking how to use Delta tables as a source for a change data capture (CDC) process. To state a different way, how can we read a stream of changes from a Delta table, so that they can be propagated downstream.

Some example use cases include (but are not limited to):

  • After cleaning the data following the Delta Architecture (bronze, silver, and gold tables), propagate this data to multiple downstream systems.

  • An e-commerce company is using a Delta table to store features related to each of their customers sourced from multiple upstream sources. Upon any customer data change, this is propagated to update downstream ML models to provide the latest product recommendations to the customer.

  • A large software company is using a Delta table to process and store 100s of TBs of customer telemetry data. Changes in this table need to be sent to a downstream consumer for updating a range of dashboards and analytics.

In each of these cases, we want to capture a change stream from a Delta table and send it somewhere for further processing. In this session, we will discuss the architecture, use cases, and solutions.

Agenda:
9:00AM - 9:50AM - Presentation + Demo
9:50AM - 10:00AM - Q&A

Link to join: https://databricks.zoom.us/j/524995713

Speakers:
Paul Roome is a Senior Solution Architect at Databricks, where he focuses on helping large customers in the Bay Area achieve their data ambitions. Prior to Databricks, Paul worked on applications of large scale entity resolution and graph analytics, as well as using machine learning to fight fraud and organized crime.

Denny Lee is a Staff Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Members are also interested in