Spark v2.0 Workshop

Name: Spark v2.0 Workshop
Start: 2016-10-07T15:00:00+03:00
End: 2016-10-07T19:00:00+03:00
Location: Tech Hub

Hosted By

Dan S.

Details

This 4-hour training course introduces Apache Spark v2.0, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).

The main focus will be on what is new in Spark v2.0 and this includes DataSets (compile-time type-safe DataFrames), Structured Streaming, as well as the de-emphasizing of RDDs.

The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights.

The Spark Streaming examples will be comprehensive. We'll start with a stream of integers (rolling sum), then we'll simulate a more full-fledged streaming architecture. We'll first stream data via TCP socket (netcat), then via Kafka topic (Apache Kafka).

We are also going to look at a very basic example of how to store a Spark DataFrame / DataSet in Cassandra using the open-source DataStax connector.

The workshop has some requirements.