Big Data with Hadoop and Spark( 6 weeks of Tues/Thur)


Details
DETAILS
sign up at eventbrite (http://www.eventbrite.com/e/big-data-with-hadoop-and-spark-tickets-17994333536). 8 seats limited.
Dates:
August 11, 13, 18, 21, 25, 27, September 1, 3, 8, 10, 15, 17
(Twelve Classes, Tuesday and Thursday Nights)
Time:
7:00-9:30pm
Length of class: 30 hours
Instructor:
Sam Kamin is Associate Professor Emeritus from the University of Illinois Champaign Urbana
where he taught computer science. Most recently he was an engineer at Google before joining NYC Data Science
Academy as VP of Engineering.
Venue:
205 E 42nd Street, New York, NY 10017( 5 min from Grand Central)
Course Overview
An intensive, hands-on introduction to the Hadoop ecosystem of Big Data technologies.
The emphasis in this course is on learning several of the major components of Apache Hadoop– HDFS, MapReduce, Hive, Pig, Streaming – by doing exercises of increasing complexity. Programming will be done in Python.
Students are expected to be familiar with using an operating system from the command line; knowledge of Python is helpful; the material in <> is sufficient background knowledge.
The course format is mixed lecture/lab. Students will need to bring their own laptops to connect to our server; instructions will be provided ahead of time as to how to install any required software.
What is Hadoop?
Hadoop is an open-source database framework that allows for the processing of large data sets using parallel computing methods. Utilizing Google’s MapReduce and the Hadoop Distributed File System (HDFS), Hadoop allows for scalability,flexibility and fault tolerance. Hadoop is optimized to handle massive quantities of data either structured, semi-structured, or unstructured– meaning.
Hadoop is perfect for Big Data. As part of the Apache Framework, there is a host of Apache compliments such as Hive, Pig and Zookeeper, that further extend Hadoop’s applications and usability.
SYLLABUS
Week 1 – Introduction: MapReduce
Overview of Big Data and the Hadoop ecosystem
The concept of MapReduce
HDFS – Hadoop Distributed File System
MapReduce with Python streaming
Week 2 – More on MapReduce
More on Big Data, the Hadoop ecosystem, and MapReduce.
Mixed case studies and exercises using MR with Python streaming
Week 3 – Hive: A database for Big Data
Hive concepts
HiveQL
User-defined functions in the Hive language
User-defined functions in Python (using streaming)
Advanced topic: Hive queries in Python code
Week 4 – Pig: Simplified MapReduce
Basic concepts
Pig Latin
Pig functions and macros
User-defined functions
Week 5 – Spark
Intro to Spark
Intro to Mahout
Week 6 – Project day
The Hadoop ecosystem
Brief intro to Spark
Brief intro to Mahout
Case studies/Final projects

Big Data with Hadoop and Spark( 6 weeks of Tues/Thur)