Skip to content

Upcoming Apache Spark and Data Lineage

Photo of Niels Zeilemaker
Hosted By
Niels Z.
Upcoming Apache Spark and Data Lineage

Details

Hi all, I'm happy to announce another Spark Meetup.
This time Tim Hunter from Databricks will talk on upcoming features of Spark 2.4, and Serge Smertin from Adyen will give a talk on Data Lineage.
The meetup will take place/is sponsored by GoDataFest which will take place from October 15-19, 2018. This week is dedicated to data technology and features free talks, training sessions and workshops. Leading tech companies, like AWS (Monday, October 15), Dataiku (Tuesday, October 16), Databricks (Wednesday, October 17), and Google Cloud (Thursday, October 18), each host an entire day to share their latest innovations. The final day, Friday, October 19, is dedicated to open-source. Feel free to mix-and-match activities to create your ultimate and personal data festival. Make sure to register directly, as seats are limited. www.godatafest.com
Agenda:
18:00: Arrive, mingle, food (pizza), drinks etc.
18:45: New features in Upcoming Apache Spark 2.4 and MLflow: An open platform to simplify the machine learning lifecycle by Tim Hunter
This talk will combine two topics: I will start with an overview of the latest developments in Spark, and I will then present a recent Databricks project for simplifying machine learning. The soon to be released Apache Spark 2.4 comes packed with a lot of new functionalities: new scheduling model, the native AVRO data source, pyspark's eager evaluation mode, kubernetes support, and a lot of other improvements. MLflow, a new open source project from Databricks that simplifies this process. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. Moreover, MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and incorporate it incrementally into an existing ML development process.
19:45: Data lineage in context of interactive analysis by Serge Smertin
In this presentation we’ll focus on tracking data lineage for interactive data exploration through notebooks we’re using within organization. A set of techniques would be shown to demonstrate how to audit data journey from code entered in notebook down to levels of execution planning, DataFrames, RDDs and Hadoop’s file formats back to visualizations displayed to data analyst. We’ve created custom tool using Java Instrumentation API, that allowed us to add extra security to certain parts of Spark driver’s JVM runtime environment and audit critical parts of job execution flow through instrumenting worker in distributed environment. This allowed us to harden the security of Python and R runtimes in integration with Jupyter notebooks through customized PySpark and SparkR kernels. This presentation would be interesting for anyone with background in Java, Scala and Python.
21:30: End of the meetup/everybody out
Hope to see you there, Niels

Photo of Data Council Amsterdam - NL Data Engineering & Science group
Data Council Amsterdam - NL Data Engineering & Science
See more events