PySpark Presented by Tim Hopper


Details
Apache Spark is a next generation cluster computing framework and data processing engine. By combining Spark's primitive operations in a functional style, the user can perform complex computations on large datasets. Though similar to Hadoop, Spark relies much more heavily on RAM (instead of HDFS) and has been demonstrated as running up to 100x faster than Hadoop for some applications. This talk will introduce Spark in general and then show PySpark, the Python wrapper around core Spark, as a tool for rapid, interactive analytics as well as robust, production data pipelines. Finally, we will look at MLlib, Spark's distributed machine learning library.
Bio:Tim Hopper is a software engineer at Parse.ly, a web analytics startup. He has a masters in operations research from North Carolina State University.

Sponsors
PySpark Presented by Tim Hopper