Skip to content

Python & Spark by Thorsten Greiner

Photo of Thorsten Greiner
Hosted By
Thorsten G. and 2 others
Python & Spark by Thorsten Greiner

Details

7:00pm Socializing

7:15pm Python & Spark

Python provides a number of libraries like NumPy, Pandas and scikit-learn, making it the programming language of choice for many data scientists. While these libraries operate efficiently on data sets which fit into RAM, they do not scale to data sets in the Tera- or Petabyte size range.

Enter Spark, a high performance engine for big data processing. Spark provides a simple and flexible API to efficiently operate on large data sets. Written in Scala, it provides language bindings for Java, R and Python. It also supports higher-level tools for machine learning (MLlib), SQL processing (Spark SQL) and graph processing (GraphX).

This talk gives an introduction to the Spark computing system and how to use Spark with Python. Integration to the SciPy ecosystem will be demonstrated with IPython Notebook.

8:00pm Lightning talks

Possibility for 2 5-minute lightning talks. Can be about anything related to data science and don't have to be submitted before hand, just tell us that night.

8:10pm Networking with beer and pizza

Thorsten's Bio

My first contact with massively parallel systems was during my study at Bergische Universität Wuppertal, where the physics department employed Connection Machines CM-2 and CM-5 for quantum chromodynamics simulations. In my professional career I have been working with Java since 1997. Since 2014 my focus is in the area of big data, specializing on Apache Hadoop and Data Engineering.

Photo of Düsseldorf Data Science Meetup group
Düsseldorf Data Science Meetup
See more events