Spark Workshop


Details
This 4-hour workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).
TechHub has graciously agreed to host us.
The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights, gaining a deep understanding of Spark’s rich collections API in the process.
We are also going to look at a very simple Spark Streaming example (stream of integers / rolling sum).
We'll first stream data via TCP socket (netcat), then via Kafka topic (Apache Kafka).
The workshop has some requirements.
- Bring your own laptop.
- Have Docker already installed before the workshop.
- Have the Docker image already pulled and available locally.
Here are the necessary instructions:
-
Install Docker
Linux:
curl -fsSL https://get.docker.com/ | sh
Mac and Windows:
https://www.docker.com/products/docker-toolbox -
docker pull dserban/sparkworkshop
The batch processing code will be in Python with iPython Notebook (Jupyter).
The Spark Streaming code will be in Scala.

Spark Workshop