Participants will learn to adapt and execute machine learning algorithms in the map reduce framework. Participants should finish the class able to author their own machine learning algorithms for map reduce and to run them on Amazon Web Services.
Participants will learn to use python code to author mappers and reducers for “hadoop-streaming”. For most of the class we will employ “mrjob” - an open-source framework developed at Yelp. Employing mrjob enables class members to program mappers and reducers in python. The mrjob framework then submits the mapper-reducer to run locally without using hadoop, to run on Amazon Web Services, or to run them on a private hadoop cluster. This will simplify the programming tasks.
Registration covers the cost of all 4 sessions. If you register at least 5 days before the class, the price is $325. You can register using credit card at http://machinelearningbigdata.eventbrite.com. If you register in the last 5 days, the price is $375. You register on eventbrite or you can pay by check or cash at the first class meeting. You can also use paypal (mike at mbowles dot com)
The class will be delivered by webcast - usually several people want to attend the class remotely. In order to take the class be webcast, you'll need to register on http://machinelearningbigdata.eventbrite.com at least 24 hours before class starts.
Here's a schedule to give an idea of what we intend to cover. We can modify the schedule to match class interests - replace one of the algorithms with another or cover more algorithms at less depth etc. We'll discuss the topics at the first class meeting.
Week 1 Implementing Algorithms on Big Data - MrJob installation
MapReduce, Hadoop Streaming, Mahout, Amazon (AWS, EMR)
Week 2 Clustering
k-means, Canopy Clustering
Week 3 Supervised Learning
EM algo for mixture model, using canopy for speedup
Week 4 Other ML Tasks
Regularized Regression - glmnet algo for elasticnet
SVM - Pegasos algo for two-class and one-class, extensions
Recommender Engine - Matrix Factorization by Gradient Descent
Other topics Decision Trees - Google PLANET, Text Mining, Ensemble Methods
-Facility with undergrad level math and stats (vector calculus, density functions, etc.)
-Comfortable programming basic python (version 2.6 or 2.7 NOT version 3).
-You'll also need to develop some familiarity with Numpy - ("random" family of functions, matrix(), array())
-Install mrjob and boto (these are both python installations)
-Familiarity with basic machine learning.