Skip to content

Spark Workshop

Photo of Dan S.
Hosted By
Dan S.
Spark Workshop

Details

This 4-hour workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).

TechHub has graciously agreed to host us.

The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights, gaining a deep understanding of Spark’s rich collections API in the process.

We are also going to look at a very simple Spark Streaming example (stream of integers / rolling sum).
We'll first stream data via TCP socket (netcat), then via Kafka topic (Apache Kafka).

The workshop has some requirements.

  1. Bring your own laptop.
  2. Have Docker already installed before the workshop.
  3. Have the Docker image already pulled and available locally.

Here are the necessary instructions:

  1. Install Docker
    Linux:
    curl -fsSL https://get.docker.com/ | sh
    Mac and Windows:
    https://www.docker.com/products/docker-toolbox

  2. docker pull dserban/sparkworkshop

The batch processing code will be in Python with iPython Notebook (Jupyter).

The Spark Streaming code will be in Scala.

Photo of The Bucharest Agile Software Meetup Group group
The Bucharest Agile Software Meetup Group
See more events
Tech Hub
39-41 Nicolae Filipescu · Bucharest