Skip to content

PySpark Presented by Tim Hopper

Photo of Melinda Thielbar
Hosted By
Melinda T.
PySpark Presented by Tim Hopper

Details

Apache Spark is a next generation cluster computing framework and data processing engine. By combining Spark's primitive operations in a functional style, the user can perform complex computations on large datasets. Though similar to Hadoop, Spark relies much more heavily on RAM (instead of HDFS) and has been demonstrated as running up to 100x faster than Hadoop for some applications. This talk will introduce Spark in general and then show PySpark, the Python wrapper around core Spark, as a tool for rapid, interactive analytics as well as robust, production data pipelines. Finally, we will look at MLlib, Spark's distributed machine learning library.

Bio:Tim Hopper is a software engineer at Parse.ly, a web analytics startup. He has a masters in operations research from North Carolina State University.

Photo of Research Triangle Analysts group
Research Triangle Analysts
See more events