Skip to content

Best Practices for PySpark, with Juliet Hougland of Cloudera

J
Hosted By
John M.
Best Practices for PySpark, with Juliet Hougland of Cloudera

Details

We are thrilled to welcome Juliet Hougland (https://twitter.com/j_houg) to the NYC Data Science meetup. Juliet will be talking about the Python API for Apache Spark, known as PySpark, and best practices for its use.

Note: The RSVP list for this talk will close at 10:00am on the day of the event.

Summary: PySpark (the component of Spark that allows users to write their code in Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious: you don't need to learn a new language, and you still have access to modules (pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark.

In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover:

• Python package management on a cluster using virtualenv.

• Testing PySpark applications.

• Spark's computational model and its relationship to how you structure your code.

Bio: Juliet is a Data Scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil & gas pipelines at Deep Signal, and designing/building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics.

Thanks to Cloudera (http://www.cloudera.com/content/cloudera/en/home.html) for sponsoring food and drink for this meetup, and to work-bench (http://www.work-bench.com/) for providing meeting space.

Photo of NYC Data Science group
NYC Data Science
See more events
Work-Bench
110 Fifth Avenue, 5th Floor · New York, NY