Skip to content

Pre Spark Summit EU Meetup: A deep dive into Spark 2.0 + unit-testing

Photo of Niels Zeilemaker
Hosted By
Niels Z.
Pre Spark Summit EU Meetup: A deep dive into Spark 2.0 + unit-testing

Details

It's been a while since our last meetup. Moreover, a new major version of Spark has been released in the mean time.

That's why we're very happy to announce that Herman van Hovell is going to give us a deepdive into Catalyst. Additionally Giovanni Lanzani is going to show us how to write unit-test for your pySpark job, and Niels Zeilemaker to do the same for scala.

Some final words, databricks has shared a special 20% discount code, MeetupAms20, for Spark Meetup Members for the upcoming Spark Summit in Brussels https://spark-summit.org/eu-2016/

Agenda:

• 18:00 Arrive, mingle, etc.

• 18:45: A Deep Dive into the Catalyst Optimizer by Herman van Hovell

Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will introduce the core concepts of catalyst by working through a few examples. I will also show how new and upcomming features are implemented using Catalyst. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes and plans a user’s query.

• 19:45: Unit-Testing Spark by Giovanni Lanzani

Data scientists are usually not good at testing code. This has been recently exacerbated by distributed frameworks like Spark, unpopular releases such as Python 3, and the abundance of SQL in the life of a data scientist.

In this talk we will see how to use the `py.test` package and a simple `setup.py` to rule them all. Followed by an short introduction to spark-testing-base created by Holden Karau.

• 21:30: Everybody out

About Herman van Hovell:

Herman van Hovell is a Spark committer working on Spark SQL at Databricks. Before joining Databricks, he worked as as an consultant working for clients in banking, manufacturing and logistics. His interests include database systems, optimization and simulation. He is an avid diver and loves to cook.

About Giovanni Lanzani:

Giovanni Lanzani is a Chief Science Officer at GoDataDriven. He got there once Rob made him an offer he couldn't refuse. Luckily a horse head was unneccessary.

A theoretical Physicist by trade (he claims Leiden University made him Doctor), he is now active in all things data science in a wide range of Dutch companies.

He was once offered the Chief Science position at the Nutella R&D department, but as he realized that the answer to all Nutella R&D projects was "Make more Nutella", he politely passed up the opportunity.

Photo of Data Council Amsterdam - NL Data Engineering & Science group
Data Council Amsterdam - NL Data Engineering & Science
See more events