PySpark: Real-time large-scale data processing with Python and Spark

Name: PySpark: Real-time large-scale data processing with Python and Spark
Start: 2014-11-11T19:00:00+01:00
End: 2014-11-11T22:00:00+01:00
Location: Immobilienscout24

Hosted By

Valentin H. and Philipp P.

PySpark: Real-time large-scale data processing with Python and Spark

Details

Spark is a lightning fast engine for large-scale data processing and the leading candidate as a successor to Map-Reduce. As a general purpose in-memory cluster computing framework it overcomes the high latency batch mode. Spark lazily evaluates and optimizes execution plans and smartly handles memory usage. It can run on Hadoop's resource manager and read any existing Hadoop data. It furthermore provides rich APIs in Scala, Java and Python.

In the presentation the speaker will introduce the general concepts of Spark's infrastructure, the underlying computation model, including the basic unit of data, the resilient distributed data sets (RDD). Demos will show expressive Python examples of real-time data transformations and interactive analytics of large-scale data sets as well as examples of iterative machine learning tasks.

Speaker: Philipp Pahl is a freelance data science consultant. He received a PhD in experimental particle physics. After leaving academia he contributed to several commercial projects, especially in big data modeling and predictive analytics.

Tuesday, November 11, 2014
7:00 PM to 10:00 PM CET

Immobilienscout24

Andreasstraße 10, Berlin · Berlin

PyData Berlin

public group

PySpark: Real-time large-scale data processing with Python and Spark