Large Scale Machine Learning Workshop #3: Hive and Mahout


Details
Hello all! After getting good feedback from workshop #2, we’ll be focusing on hands-on tasks in workshop #3. We’ll be using Hive queries to create a table from a large-scale dataset. Once we have the data in a convenient form, we’ll be using Mahout to create a machine learning model.
We’re also fortunate to have Cloudera attending to give a presentation on Cloudera Director. This is the tool you’ll be using to bring up clusters on AWS machines, and is a follow-on from the Cloudera Manager presentation in workshop #2.
PLEASE ARRIVE AT THE MEETUP WITH YOUR CLUSTER RUNNING. The pre-work has all the steps you need to spin up your cluster. The cluster only costs $1 per hour, so you can leave this running before the workshop.
Agenda
-
Cloudera will present details of Cloudera Director.
-
Omar will give a brief presentation to the tools we’re using in the meetup: Hive and Mahout.
-
Jaya will give a hands-on tutorial on using Hive for ETL of data.
-
Omar will give a hands-on tutorial on Mahout to train a model on the Hive tables.
-
Finally, we’ll have time for questions and wrap up the session.
Pre-work
The pre-work for this week can be found in the Week3 folder in the Hadoop Dropbox folder http://bit.ly/acmawshadoop . In the pre-work you’ll be spinning up a cluster using Cloudera Director, and checking Cloudera Manager. The pre-work takes 90 minutes, but don’t be scared! Only about 20 minutes of hands-on work is needed, the other 70 minutes is spent downloading and installing software on the cluster.
Getting there
The logistics for the meetup can be found here: http://bit.ly/acmawshadoopinfo . This includes directions to the AWS office, and background information.
Any other questions?
If you have any other questions, please leave a comment below.

Large Scale Machine Learning Workshop #3: Hive and Mahout