PySpark: Real-time large-scale data processing with Python and Spark

Andreasstraße 10, Berlin · Berlin

How to find us

It will be in the "ScoutCasino" which is the company canteen on the ground floor in the back of the building.

Spark is a lightning fast engine for large-scale data processing and the leading candidate as a successor to Map-Reduce. As a general purpose in-memory cluster computing framework it overcomes the high latency batch mode. Spark lazily evaluates and optimizes execution plans and smartly handles memory usage. It can run on Hadoop's resource manager and read any existing Hadoop data. It furthermore provides rich APIs in Scala, Java and Python.

In the presentation the speaker will introduce the general concepts of Spark's infrastructure, the underlying computation model, including the basic unit of data, the resilient distributed data sets (RDD). Demos will show expressive Python examples of real-time data transformations and interactive analytics of large-scale data sets as well as examples of iterative machine learning tasks.

Speaker: Philipp Pahl is a freelance data science consultant. He received a PhD in experimental particle physics. After leaving academia he contributed to several commercial projects, especially in big data modeling and predictive analytics.