Skip to content

PySpark: Real-time large-scale data processing with Python and Spark

Photo of Valentin Haenel
Hosted By
Valentin H. and Philipp P.
PySpark: Real-time large-scale data processing with Python and Spark

Details

Spark is a lightning fast engine for large-scale data processing and the leading candidate as a successor to Map-Reduce. As a general purpose in-memory cluster computing framework it overcomes the high latency batch mode. Spark lazily evaluates and optimizes execution plans and smartly handles memory usage. It can run on Hadoop's resource manager and read any existing Hadoop data. It furthermore provides rich APIs in Scala, Java and Python.

In the presentation the speaker will introduce the general concepts of Spark's infrastructure, the underlying computation model, including the basic unit of data, the resilient distributed data sets (RDD). Demos will show expressive Python examples of real-time data transformations and interactive analytics of large-scale data sets as well as examples of iterative machine learning tasks.

Speaker: Philipp Pahl is a freelance data science consultant. He received a PhD in experimental particle physics. After leaving academia he contributed to several commercial projects, especially in big data modeling and predictive analytics.

Photo of PyData Berlin group
PyData Berlin
See more events
Immobilienscout24
Andreasstraße 10, Berlin · Berlin