Spark v2.0 Workshop


Details
This 4-hour training course introduces Apache Spark v2.0, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).
The main focus will be on what is new in Spark v2.0 and this includes DataSets (compile-time type-safe DataFrames), Structured Streaming, as well as the de-emphasizing of RDDs.
The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights.
The Spark Streaming examples will be comprehensive. We'll start with a stream of integers (rolling sum), then we'll simulate a more full-fledged streaming architecture. We'll first stream data via TCP socket (netcat), then via Kafka topic (Apache Kafka).
We are also going to look at a very basic example of how to store a Spark DataFrame / DataSet in Cassandra using the open-source DataStax connector.
The workshop has some requirements.
-
Bring your own laptop.
-
Have Docker already installed before the workshop.
-
Have the Docker image already pulled and available locally.
Here are the necessary instructions:
- Install Docker
Linux: curl -fsSL https://get.docker.com/ | sh
Mac and Windows: https://www.docker.com/products/docker-toolbox
- docker pull dserban/sparkworkshop
The batch processing code will be in Python with iPython Notebook (Jupyter).
The Spark Streaming code will be in Scala.

Spark v2.0 Workshop