Spark v2.0 Workshop


This 4-hour training course introduces Apache Spark v2.0, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).

The main focus will be on what is new in Spark v2.0 and this includes DataSets (compile-time type-safe DataFrames), Structured Streaming, as well as the de-emphasizing of RDDs.

The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights.

The Spark Streaming examples will be comprehensive. We'll start with a stream of integers (rolling sum), then we'll simulate a more full-fledged streaming architecture. We'll first stream data via TCP socket (netcat), then via Kafka topic (Apache Kafka).

We are also going to look at a very basic example of how to store a Spark DataFrame / DataSet in Cassandra using the open-source DataStax connector.

The workshop has some requirements.

1. Bring your own laptop.

2. Have Docker already installed before the workshop.

3. Have the Docker image already pulled and available locally.

Here are the necessary instructions:

2. Install Docker

Linux: curl -fsSL | sh

Mac and Windows:

3. docker pull dserban/sparkworkshop

The batch processing code will be in Python with iPython Notebook (Jupyter).

The Spark Streaming code will be in Scala.