Skip to content

Spark overview and PySpark demo

Spark overview and PySpark demo

Details

What is Apache Spark? (http://spark.incubator.apache.org/)

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

To make programming faster, Spark provides clean, concise APIs in Python (http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-job-in-python), Scala (http://www.scala-lang.org) and Java (http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-job-in-java). You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.

Josh Rosen from UC Berkely AMPLab will provide a big-picture overview of Spark coupled with a live demo of PySpark on an EC2 cluster. At the AMP Camp, Fernando Perez wrote a tutorial on accessing PySpark through IPython notebook ( http://nbviewer.ipython.org/6384491/00-Setup-IPython-PySpark.ipynb ), on which the demo will be based.

See you there!

Photo of Code and  Data group
Code and Data
See more events
Twilio HQ
645 Harrison St. 3rd Floor · San Francisco, CA