Skip to content

Big Data Workshop

Photo of
Hosted By
John V.


RSVPs are closed for this event. Another workshop event has been set up. Please RSVP on that event page. Many apologies for the inconvenience. (

Folks on the WAITLIST: Please RSVP at the new event. *** I have a list of everyone who is on the waitlist so as to help you get an RSVP for the repeat event.


Our goal for this event is to have participants walk out of the workshop with information (i.e. files, code, documentation, processes, links to good resources, etc.) that can help a person begin to tackle big datasets / projects at their leisure.

Microsoft NERD has graciously provided event space for Sat. March 10th from 9-5. Many thanks! Please note there is no breakfast. The workshop will be in the Horace Mann room which is the large room on the first floor.

Programming Languages: The workshop is not a hackfest and is geared towards beginners and folks who know some Python or R. We realize Java is very important for Hadoop; however it is not covered to any great detail here. Hopefully, a future Big Data Java event can be held in the Hadoop, Java, or Data Scientist groups. Similarly, SQL, which is hugely important for Business Intelligence, is only partially covered, and will need a future event.

Preliminary Schedule: I did not list 5 min breaks - will take as folks desire.

  • Brief Overview/Acknowledgments

  • 9:30 -11:30 Map/Reduce Tutorial - Vipin Sachdeva:

Running a large file on a laptop at some point crashes the machine given memory limitations. Although we will run the big job in the cloud, running Hadoop locally can help with test and debug. Python is used; however the general concepts apply across languages.

  • 11:30 to 12:00 Pizza provided by Microsoft and Think Big Analytics

  • 12:00 -1:30 Cloud Computing - Jim O'Neil

Jim will cover all things cloud, including running the Python word count in the cloud using a GUI "Hadoop Streaming" interface. Many languages (Python, R, Ruby, etc) can use Hadoop Streaming - we just happened here to use Python.

and Cloud Numerics C# Demo - Roope Astala

  • 1:30 to 2:15 Mortardata Demo (Pig/Python with R, Ruby, Perl, etc., to be supported in the future)- K Young

CBS Interactive and Web Analytics (Python/Hadoop) - Michael Sun

  • 2:15 to 4:00 VM, R-Studio, rmr - Jeffrey Breen

Another way into the cloud is directly through a virtual manager. Jeffrey will guide us through the setup, and then proceed to load R-Studio onto a node - essentially one now has access to a bigger computer to run jobs. Jeffrey will then use the 'rmr' Hadoop-based package with an airline performance dataset.

  • 4:00 - 5:00 Post-Workshop Setup:

We realize that some folks may not have time during the week for software installations, or may encounter s/w problems especially with the VM part. The thought here is to have a help session to assist folks in getting things set before heading home.

Prizes: O'Reilly books to be given away at various points in the day, and a Kinect at the end.


  • the front of the room will have tables set up (about 100 seats)

  • the far back will have chairs and some tables (about 70 seats). if you plan to attend though not program too much, then please sit back here so folks with laptops can use the tables.

  • always an unknown as to the % of RSVPs that show. we geusstimate about 75% which is why last Saturday we upped the waitlist from 150 [112] to 225 [170]. So there may be some SRO; however we've seen too many events with empty seats due to no shows, and thus the guesstimate of 75%.


  • Software Installation: Please note if you want to follow along, then that is fine. Links to code and datasets to be provided as well - our presenters are creating decks, and so please be patience.

  • Python

  • Hadoop: A local laptop install is for test and debug. Not required.

  • Cloud Account: Why? The thought was that post-workshop many folks who want to analyze big data would likely need a cloud account. The cost is in the handful of dollars to run project examples.

The first example will be through Amazon's Elastic Map/Reduce. Similar in nature to:

  • Virtual Manager from Cloudera's+Hadoop+Demo+VM

  • R-Studio: A library which will be used is rmr.

  • Getting R and Cloud together: r-bloggers has a post that discussed getting R-Studio connected. I tried it out and it worked fine. Tore Opsahl has instructions for R and R-Studio for EC2 that is a bit more detailed.

For additional sessions, we tossed around ideas - needless to say this is a beginner workshop, and we are just there yet for advanced apps. Hopefully for workshop #2 we can get into some of these, e.g. Mahout for clustering, recsys.

A second poll question pertained to familiarity with programming languages. SQL, Python, Java, and R more or less.

1 Memorial Drive · Cambridge, MA
47 spots left