Big Data Workshop

RSVPs are closed for this event.  Another workshop event has been set up. Please RSVP on that event page.  Many apologies for the inconvenience.

http://www.meetup.com/Boston-Predictive-Ana...

Folks on the WAITLIST:  Please RSVP at the new event. *** I have a list of everyone who is on the waitlist so as to help you get an RSVP for the repeat event. 

--------------------------------------------------------------------

Our goal for this event is to have participants walk out of the workshop with information (i.e. files, code, documentation, processes, links to good resources, etc.) that can help a person begin to tackle big datasets / projects at their leisure.

 

Microsoft NERD has graciously provided event space for Sat. March 10th from 9-5. Many thanks!  Please note there is no breakfast.  The workshop will be in the Horace Mann room which is the large room on the first floor.

 

Programming Languages:  The workshop is not a hackfest and is geared towards beginners and folks who know some Python or R. We realize Java is very important for Hadoop; however it is not covered to any great detail here. Hopefully, a future Big Data Java event can be held in the Hadoop, Java, or Data Scientist groups.  Similarly, SQL, which is hugely important for Business Intelligence, is only partially covered, and will need a future event.

 

Preliminary Schedule: I did not list 5 min breaks - will take as folks desire. 


- Brief Overview/Acknowledgments

- 9:30 -11:30 Map/Reduce Tutorial - Vipin Sachdeva:

Running a large file on a laptop at some point crashes the machine given memory limitations.   Although we will run the big job in the cloud, running Hadoop locally can help with test and debug.  Python is used; however the general concepts apply across languages.

- 11:30 to 12:00  Pizza provided by Microsoft and Think Big Analytics

- 12:00 -1:30 Cloud Computing - Jim O'Neil

Jim will cover all things cloud, including running the Python word count in the cloud using a GUI "Hadoop Streaming" interface.  Many languages (Python, R, Ruby, etc) can use Hadoop Streaming - we just happened here to use Python.

and Cloud Numerics C# Demo - Roope Astala

http://blogs.msdn.com/b/cloudnumerics/archive/2012/02/07/cloud-numerics-example-analyzing-demographics-data-from-windows-azure-marketplace.aspx

 

- 1:30 to 2:15  Mortardata Demo (Pig/Python with R, Ruby, Perl, etc., to be supported in the future)- K Young

http://mortardata.com/

 

CBS Interactive and Web Analytics (Python/Hadoop) - Michael Sun

 

 

- 2:15 to 4:00   VM, R-Studio, rmr - Jeffrey Breen

Another way into the cloud is directly through a virtual manager.  Jeffrey will guide us through the setup, and then proceed to load R-Studio onto a node - essentially one now has access to a bigger computer to run jobs.  Jeffrey will then use the 'rmr' Hadoop-based package with an airline performance dataset.

 

- 4:00 - 5:00  Post-Workshop Setup:  

 

We realize that some folks may not have time during the week for software installations, or may encounter s/w problems especially with the VM part.  The thought here is to have a help session to assist folks in getting things set before heading home.

 

Prizes:  O'Reilly books to be given away at various points in the day, and a Kinect at the end. 

 

 

Seating:

- the front of the room will have tables set up (about 100 seats)

- the far back will have chairs and some tables (about 70 seats).  if you plan to attend though not program too much, then please sit back here so folks with laptops can use the tables.

- always an unknown as to the % of RSVPs that show.  we geusstimate about 75% which is why last Saturday we upped the waitlist from 150 [112] to 225 [170].  So there may be some SRO; however we've seen too many events with empty seats due to no shows, and thus the guesstimate of 75%.

 

 

Pre-Workshop:

- Software Installation:  Please note if you want to follow along, then that is fine.  Links to code and datasets to be provided as well - our presenters are creating decks, and so please be patience.

 

- Python

- Hadoop: A local laptop install is for test and debug.  Not required.

- Cloud Account:  Why?  The thought was that post-workshop many folks who want to analyze big data would likely need a cloud account. The cost is in the handful of dollars to run project examples.   

http://aws.amazon.com/

The first example will be through Amazon's Elastic Map/Reduce.  Similar in nature to:

http://www.youtube.com/watch?v=kNsS9aDf6uE

 

- Virtual Manager from Cloudera

https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM

- R-Studio:  A library which will be used is rmr.  

https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr

- Getting R and Cloud together:  r-bloggers has a post that  discussed getting R-Studio connected. I tried it out and it worked fine.  Tore Opsahl has instructions for R and R-Studio for EC2 that is a bit more detailed.

http://www.r-bloggers.com/rstudio-in-the-cloud-for-dummies/

http://toreopsahl.com/2011/10/17/securely-using-r-and-rstudio-on-amazons-ec2/

For additional sessions, we tossed around ideas - needless to say this is a beginner workshop, and we are just there yet for advanced apps.  Hopefully for workshop #2 we can get into some of these, e.g. Mahout for clustering, recsys.

 

http://www.meetup.com/Boston-Predictive-Analytics/polls/452732/

 

A second poll question pertained to familiarity with programming languages.   SQL, Python, Java, and R more or less.

 

http://www.meetup.com/Boston-Predictive-Analytics/polls/452772/

 

Join or login to comment.

  • Dag H.

    Very good, very good information, good talks, good questions and answers, good organizations.

    March 12, 2012

  • John M.

    Many thanks to the organizers and sponsors. It was a great event
    and I learned a lot.
    Thanks especially to Vipin for the presentation on stand-alone hadoop
    and Jeffery for all things hadoop in the cloud.

    - john

    March 12, 2012

  • Becky

    Really appreciate the planning and preparation that went into this--clearly, the organizers and speakers put a lot of thought into the agenda and presentations. Technical presenters struck a good balance between giving enough technical background for context and keeping things general enough for a beginning workshop. I found the day extremely helpful and a great starting point for hands-on Hadoop.

    March 12, 2012

  • David C.

    Great workshop--all the presenters were excellent.

    Vipin: to install hadoop for dev/test on a Ubuntu machine, the best instructions I found are at:

    http://www.michael-noll.com/tut...­

    March 12, 2012

  • Das S.

    Great to see folks sharing their experiences & teach. Kudos to organizers for bringing such large event to successfully materialize

    March 11, 2012

  • Tom H.

    I really appreciate the collective effort of John, the speakers and Microsoft to put this together. I found it very informative and really heartening to see everyone interested and pulling together to educate each other. A great day.

    March 11, 2012

  • Rohit K.

    Very enlightening and hands on. Can't wait to participate in the next one!

    March 11, 2012

  • Carlos A B.

    It was a true workshop. Great introduction to map-reduce, Hadoop and AWS.

    March 11, 2012

  • Ben O.

    Very well done. Kept material at a consistent level throughout. Next time maybe have a couple of separate tracks- i.e. R or Python- not both in same crowd.

    March 11, 2012

  • Luiz F.

    Fantastic! Very practical and hands-on. Walked away with a lot of new knowledge. Looking forward to the next session.

    March 11, 2012

  • A former member
    A former member

    Some presentations very basic (but then that was the goal of the workshop). The others were good introductions to various big-data techniques and quite well presented. Kudos for a great effort. Hope to see more.

    March 11, 2012

  • Lynn C.

    very good but perhaps too many talks -- although they were all excellent.

    March 11, 2012

  • John V.

    Many thanks to the presenters, attendees, and sponsors! Met some great folks and learned some cool new things about Big Data!

    March 11, 2012

  • Kent J.

    A big thank you to the organizers and presenters of this event! The content was excellent and took a lot of the mystery out of using Hadoop.

    March 11, 2012

  • Peter

    Very well organized and presented. Really appreciated the work that went on beforehand to get people set up with software, and the high quality of the presenters and coherence/relevance of the topics. Audience was well-behaved, too.

    March 11, 2012

  • Bill W.

    I've run Virtual PC on my Windows desktop, so pretty sure that's fine. And I'm not taking the desktop to the session... so I'll focus on the MacBook Pro (which I am a total rookie with). I don't have any more cycles for tonight - I'll seek help tomorrow. THIS IS WHY HADOOP as a SERVICE IS SUCH A GOOD IDEA. :-)

    March 9, 2012

  • John V.

    Any error message? Is virtualization enabled in the BIOS? We plan a help session at the end - also perhaps before or during the breaks help can be provided. For the help session, I plan to ask audience members who have it working to also help out.

    March 9, 2012

  • Sonia F.

    I hope it's not to late for someone to jump from the waiting list. I'm changing my RSVP. Didn't have time to prepare at least a bit for the talk and this huge waiting list is giving me some pressure. I hope you guys can make this workshop again.

    March 9, 2012

  • Bill W.

    I (believe I) followed the instructions to install a VM host (I chose VirtualBox since it works on both Windows and Mac) and I got it to (apparently) install fine on both, built new VMs using the Cloudera virtual drive (from here https://ccp.cloudera.com/display...­
    , per software-installation.ppt), and on both systems (8 GB 64-bit Win 7, 4 GB 64-bit Mac, at least 2 GB RAM allocated to the VM) the resultant VM hangs during boot. Anyone else have any luck on it?

    March 9, 2012

  • Vipin S.

    Try the following link: https://s3.amazonaws.com/com.had...­

    March 9, 2012

  • John V.

    Along those lines, I wonder is there a free file sharing service like DropBox out there that might work for a file that big?

    March 9, 2012

  • Gary L.

    The scripts work fine with Python 2.7.2 on Win 7. Could a link be provided to a compressed version of the Very Large input file, all.txt? Using 7-zip on the large, input2.txt, reduced it from 76 MB to under 21 MB. A similar reduction of the very large file would be a big help.

    March 9, 2012

  • Vipin S.

    My Macbook actually has 2.6.1 and the scripts work fine on my laptop. The python scripts do not really use any of the new features of Python.

    1 · March 9, 2012

  • Bill W.

    Which version(s) of Python will work? I see an "old school" 2.7.2 and new fangled (and not entirely back-compat) 3.2.2 out. My laptop has 2.7.1 on it. Do I need to upgrade?

    March 9, 2012

  • John V.

    Hi Bill, seems like the map/reduce programs would work, and similarly the VM and R section. The cloud streaming job would appear not applicable since AWS will be used. Plans are to get the code and datasets up on a wiki or the like for folks to access.

    March 3, 2012

  • John V.

    Dan, the original R can be used. I will add a link for additional instructions. 'rmr' is the main package that will be used for r/Hadoop. So for Python we are using locally and through AWS. I wondered about Python via a VM similar to what Jeffrey is doing for R. We don't have that in the workflow, though worth following up on to provide folks with that alternative if it makes sense.

    March 3, 2012

  • Bill W.

    I have access to the Hadoop on Azure environment. Can you think of any reason to not use that for the hands on part?

    March 3, 2012

  • Dan R

    Sounds like much good work is well on it's way to a great workshop. Thanks to all. Additional software details in advance would help, since 255 people installing packages over wifi during the workshop could be troublesome. Relevant questions:
    Should we actually install R and R-Studio, or will be be using R-Studio Server on a server? If we install R, what packages other than R-base are suggested? For Python, which modules beyond the Standard Library are recommended? Git? SSH? -- Thanks

    March 2, 2012

  • A former member
    A former member

    Great..!!. Thanks John for the information .

    March 2, 2012

  • John V.

    No. Two packages to be used are Python and R-Studio. These are open source and both have good websites for downloading. The morning session will use Python; afternoon R.
    The cloud s/w involves more instruction and explanation. I have a general workshop email to send to cover the agenda/schedule, seating, software, waitlist, food, parking, et al. The very good news is that the presenters have successful ran their programs, and have been in the very time-consuming phase of creating decks.

    March 2, 2012

  • A former member
    A former member

    Did you guys send Software installation document? Just wanted to confirm as I did not received any. Thanks.

    March 2, 2012

  • John V.

    The waitlist is set up to automatically such that when a person changes an RSVP to not attending, then the next person in queue on the waitlist is added to the event. The waitlist includes a timestamp. At the moment about 20 or so folks ahead of you in the queue. We are planning a workshop part II, as well as repeating sessions as evening events. AGree that folks who have a change in plans should update, and will send out reminder emails beginning on the weekend.

    March 1, 2012

  • Nik

    Did the wait list move at all? If yes, how do I know if I am confirmed or not ? Need to make some travel arrangements and hence, the question. Also, requesting other members to please cancel in case they don't plan to come or their 'guests' don't plan to come.

    March 1, 2012

  • John V.

    MIT has a couple parking lots that are free on Saturdays. One lot is on Hayward Street, the other close by. From there its about a 5 minute walk to Microsoft NERD. Will post more parking info once I can find an MIT parking map.

    February 25, 2012

  • Bill H.

    According to the site's information, there is a parking garage on the site just past the entrance. Does anyone know whether it will be open on Saturday or what it's hours are? I couldn't find them.

    February 25, 2012

  • Chandra P.

    Is there a parking facility near by?

    February 25, 2012

  • John V.

    Hi Remy, just rsvp for the event and meetup will add you to the waitlist.

    February 15, 2012

  • Remy F.

    How can I as myself to the waiting list?

    February 15, 2012

  • Douglas M.

    A meta list of data set lists and data sets from MA local @KDNuggets http://bit.ly/z59SfY­

    January 26, 2012

  • Douglas M.

    Found on Quora an interesting list of public data sources for a UCBerkeley data science course co-taught by Jeff Hammerbacher, Cloudera's Chief Scientist. http://goo.gl/V3e85­

    January 20, 2012

  • John V.

    Thanks! I added SAS (and took down Ruby). Will note we were 15 votes in.

    January 13, 2012

  • Theresa D.

    BTW.... You forgot to include SAS on the second poll.

    January 13, 2012

  • Douglas M.

    I should have said, better electricity price, load, weather based demand models. I know where to get 10 years of hourly electric load and price data for New York State.

    January 13, 2012

People in this
Meetup are also in:

Create your own Meetup Group

Get started Learn more
Katie

I'm surpris ed by the level of growth I've seen since becoming an organizer, it's given me more confidence in my abilities.

Katie, started NYC ICO

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy