Skip to content

Apache Spark and Amazon Workshop

Photo of Brian Husted
Hosted By
Brian H.
Apache Spark and Amazon Workshop

Details

  1. Go to the lab page located here https://qwiklabs.com/focuses/preview/1193?locale=en and select lab.

  2. Create a new account

  3. Confirm the account creation by the email you provided

  4. Enter the token for the lab

  5. Do not start lab until the instructor says so.

My demo script is located here:

https://raw.githubusercontent.com/notjasonmorris/AWS/master/EMR/demo.sh

They can wget https://raw.githubusercontent.com/notjasonmorris/AWS/master/EMR/demo.sh on their local machine.

Agenda

4:30 - 5:00 - Networking and Refreshements

5:00 - 5:30 - Elastic Map Reduce Presentation and Connection to EMR cluster

5:45 - 8:30 - Interactive Spark lecture and exercizes

Overview

Please join us for an exciting workshop to learn more about Apache Spark and the Amazon Elastic Map Reduce (EMR) platform. This workshop was developed by Tetra Concepts (http://www.tetraconcepts.com/) and Amazon (http://aws.amazon.com/), and sponsored by BAE Systems (http://www.baesystems.com/). There will be a mix of brief lectures and demos followed by hands-on technical exercises in Scala and Spark. Each developer will be provided with an Amazon EMR cluster. The goal of this workshop is to gain a basic hands-on introduction to Spark and EMR while learning functional programming techniques.

Audience and Pre-requisites:

This workshop is intended for software developers who have a background developing in Java, Python, or Scala with familiarity in the MapReduce paradigm. No experience with Apache Spark is required. The brief lectures will introduce Amazon Elastic Map Reduce and Scala: enough to learn to use the Spark Shell. The case studies and hands-on exercises will focus on using Spark to accelerate the traditional MapReduce design and build cycle.

To participate in the workshop, each developer is required to bring a laptop with a Unix/Linux operating system, WIFI, and ssh access. This can be a virtual machine running within a Windows host.

IMPORTANT: Before the meetup, please install the Amazon command line interface (CLI) for Linux. This requires Python 2.6.5 or higher, and must be installed using pip "pip install awscli" The installation instructions for Amazon CLI can be found here: http://aws.amazon.com/cli/

Course Outline:

Topics covered include:

• Installing Spark locally

• Deploying a Spark instance with Amazon's Elastic MapReduce

• Basic theory of Resilient Distributed Dataset

• Data exploration with Spark at the Spark Shell

• Using Spark's core APIs in Scala

• Using Spark's PairRDD functions

• Deploying a job on a Spark cluster

• How to access logs and diagnose a running job

Instructors

The lectures and problem sets will be presented by Dr. JT Halbert, Tetra Concept's Chief Data Scientist. JT has over a decade of experience solving hard problems in various fields: orbital mechanics and control, nonlinear dynamics and Chaos theory, cloud computing, computer network defense. JT is passionate about helping people infer patterns, extract insight, and communicate these from the records of the observable world.

Jason Morris from Amazon will present Spark on Elastic Map Reduce. Jason is a Solutions Architect for Amazon Web Services. He specializes in high performance computing, big data architecture, and GPU computing. Jason is also a certified instructor for the big data, systems architecture, and sysops class and works closely with Amazon's EMR team. He's worked in the big data ecosystem for the last 6 years with a total of 14 years experience in the technology field.

Photo of Distributed Computing Maryland group
Distributed Computing Maryland
See more events
The Hotel at Arundel Preserve
7795 Arundel Mills Boulevard · Hanover, MD