Skip to content

Big Data with Hadoop and Spark( 6 weeks of Tues/Thur)

Photo of Vivian Zhang
Hosted By
Vivian Z.
Big Data with Hadoop and Spark( 6 weeks of Tues/Thur)

Details

DETAILS

sign up at eventbrite (http://www.eventbrite.com/e/big-data-with-hadoop-and-spark-tickets-17994333536). 8 seats limited.

Dates:
August 11, 13, 18, 21, 25, 27, September 1, 3, 8, 10, 15, 17
(Twelve Classes, Tuesday and Thursday Nights)

Time:
7:00-9:30pm

Length of class: 30 hours

Instructor:
Sam Kamin is Associate Professor Emeritus from the University of Illinois Champaign Urbana
where he taught computer science. Most recently he was an engineer at Google before joining NYC Data Science
Academy as VP of Engineering.

Venue:
205 E 42nd Street, New York, NY 10017( 5 min from Grand Central)

Course Overview

An intensive, hands-on introduction to the Hadoop ecosystem of Big Data technologies.

The emphasis in this course is on learning several of the major components of Apache Hadoop– HDFS, MapReduce, Hive, Pig, Streaming – by doing exercises of increasing complexity. Programming will be done in Python.

Students are expected to be familiar with using an operating system from the command line; knowledge of Python is helpful; the material in <> is sufficient background knowledge.

The course format is mixed lecture/lab. Students will need to bring their own laptops to connect to our server; instructions will be provided ahead of time as to how to install any required software.

What is Hadoop?

Hadoop is an open-source database framework that allows for the processing of large data sets using parallel computing methods. Utilizing Google’s MapReduce and the Hadoop Distributed File System (HDFS), Hadoop allows for scalability,flexibility and fault tolerance. Hadoop is optimized to handle massive quantities of data either structured, semi-structured, or unstructured– meaning.

Hadoop is perfect for Big Data. As part of the Apache Framework, there is a host of Apache compliments such as Hive, Pig and Zookeeper, that further extend Hadoop’s applications and usability.

SYLLABUS

Week 1 – Introduction: MapReduce

Overview of Big Data and the Hadoop ecosystem
The concept of MapReduce
HDFS – Hadoop Distributed File System
MapReduce with Python streaming

Week 2 – More on MapReduce

More on Big Data, the Hadoop ecosystem, and MapReduce.
Mixed case studies and exercises using MR with Python streaming

Week 3 – Hive: A database for Big Data

Hive concepts
HiveQL
User-defined functions in the Hive language
User-defined functions in Python (using streaming)
Advanced topic: Hive queries in Python code

Week 4 – Pig: Simplified MapReduce

Basic concepts
Pig Latin
Pig functions and macros
User-defined functions

Week 5 – Spark

Intro to Spark
Intro to Mahout

Week 6 – Project day

The Hadoop ecosystem
Brief intro to Spark
Brief intro to Mahout
Case studies/Final projects

Photo of NYC Data Science Academy group
NYC Data Science Academy
See more events
205 East 42nd Street, 19th floor, New York, NY · New York, NY