Course datasets

Downloading the dataset

The airline delay dataset can be found here. The site is slow, please use this link for faster download.
You can also use the enriched dataset provided by Raj


Preparing the dataset for R

Take a look at a data sample

Before we start running a big RHadoop task let's copy some data on a csv file:
bzcat 2004.csv.bz2 | head -1000 > 2004-1000.csv

Loading the data on HDFS
first make few directories directory
hadoop fs -mkdir /user/cloudera
hadoop fs -mkdir asa-airline
hadoop fs -mkdir asa-airline/data
hadoop fs -mkdir asa-airline/out
Now load the data
hadoop fs -put 2004-1000.csv asa-airline/data/

if you want to move all the data unzip them and put then in the asa-airline/data/ directory








Table of Contents

Page title Most recent update Last edited by
ICML 2013 Review August 2, 2013 4:15 PM nikolaos v.
Lesson 8 April 10, 2013 1:57 PM nikolaos v.
Lesson 7 April 3, 2013 11:44 AM nikolaos v.
Other clustering December 6, 2012 3:33 PM nikolaos v.
Distributed k-means December 5, 2012 11:23 PM nikolaos v.
Introduction to k-means December 5, 2012 11:09 PM nikolaos v.
Download a virtual machine November 28, 2012 9:48 AM nikolaos v.
Lesson 3 December 6, 2012 4:21 PM nikolaos v.
Decision Tree November 16, 2012 3:21 PM nikolaos v.
Regression Tree November 16, 2012 3:10 PM nikolaos v.
Lesson 2 Run a big logistic regression November 16, 2012 2:33 PM nikolaos v.
Lesson 2 Logistic Regression November 16, 2012 2:23 PM nikolaos v.

Our Sponsors

  • Ismion Inc

    The instructor for teaching the courses

  • LogicBlox Inc

    LogicBlox offers space, equipment and instructors payment

  • Predictix

    Paying for cloud time and for TAs

  • Kabbage

    Space and great pizza

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy