Skip to content

Open Source Reliability for Data Lake with Apache Spark

Photo of Matthew Hunt
Hosted By
Matthew H.
Open Source Reliability for Data Lake with Apache Spark

Details

Presenter: Michael Armbrust

Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

In this talk, we will cover
What data quality problems Delta helps address
How to convert your existing application to delta
How the Delta transaction protocol works internally
The Delta roadmap for the next few releases
How to get involved!

Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Photo of Spark-NYC group
Spark-NYC
See more events
120 Park Ave
120 Park Ave · NYC, NY