Doorgaan naar de inhoud

Pre Spark Summit East: Behind the scenes and Exception handling

Foto van Niels Zeilemaker
Hosted By
Niels Z.
Pre Spark Summit East: Behind the scenes and Exception handling

Details

Hi All, I'm happy to announce two speakers from DataBricks who will talk about the history of and future of Spark. First, Reynold Xin (a co-founder of Databricks) will shed some light on the history and evolution of data processing software. After, Herman, will present how Spark will deal with exceptions in ETL jobs in the future.

GoDataDriven will sponsor the drinks and pizza.

Agenda:

• 18:00 Arrive, mingle, pizza, drinks etc.

• 18:45 A behind the scenes look into Spark's API and engine evolutions by Reynold Xin.

Apache Spark is the most popular open source project in big data. While many users initially came to Spark for its performance, they stayed for the expressiveness of the APIs and ease-of-use of the engine.

In this talk, I will look back at the history of data processing software, from file systems, hierarchical databases, relational databases, big data systems (e.g. MapReduce), to "small data" systems (e.g. R, Python). I will examine the pros and cons of these different systems, the abstractions they provide, and the engines underneath. I will then discuss lessons we can learn from this evolution, how Spark is developed in this context, and a peak into the future.

• 19:45: Exceptions are the norm: dealing with bad actors in ETL by Herman van Hovell.

Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.

In this talk we go over new and upcoming features in Spark that enable it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.

• 21:30 Everybody out

Speakers:

Reynold Xin is a co-founder and Chief Architect at Databricks, a San Francisco-based cloud big data platform company, founded by the creators of Apache Spark. At Databricks, he led the development of Spark and pushed Spark to be the most popular open source big data project. Prior to Databricks, he was pursuing PhD research at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Herman van Hovell is a Spark committer working on Spark SQL at Databricks. Before joining Databricks, he worked as an consultant for clients in banking, manufacturing and logistics. His interests include database systems, optimization and simulation. He is an avid diver, and likes to cook in his free time. He is also the first Databricks engineer in Europe and the new Databricks Amsterdam R&D center!

Photo of Data Council Amsterdam - NL Data Engineering & Science group
Data Council Amsterdam - NL Data Engineering & Science
Meer evenementen bekijken