The Spark team has been hard at work on two big features for release 0.7: PySpark, which adds a Python API to Spark, and an alpha release of Spark Streaming, which adds easy-to-use stream processing functionality. With Spark 0.7 coming out very soon, this meetup will introduce attendees to the new features. We're going to have two presenters:
1) Josh Rosen will show how to use the Python API. PySpark provides almost all of the features of Spark to Python programmers, both in standalone programs and from the python and IPython interactive shells. It works with the standard CPython engine, letting you use native libraries like NumPy and SciPy in your Spark programs. It also handles shipping functions to the cluster just like in Java and Scala. We encourage you to invite your Python friends to learn about it!
2) Tathagata Das (TD) will cover Spark Streaming, a new extension of Spark to do near-real-time stream processing that will be available as an alpha in Spark 0.7. We introduced Spark Streaming from a research perspective last summer, but this talk will show what the complete API looks like, and discuss issues such as data input sources and fault tolerance. TD will also cover several applications, including a prototype implemented at Conviva to take Conviva's Hadoop-based batch analytics pipeline (a series of MapReduce jobs that normally sees 5-10 minutes of latency) and run the same Hadoop code on Spark Streaming with 2-second latency. This ability to run the same code in both batch and streaming settings is one of the reasons why we're very excited about Spark Streaming.
Conviva graciously offered to host this meetup at its San Mateo office. Food will be provided. Doors open at 6:30, with talks starting at 7.