The RHadoop Project

  • October 11, 2011 · 6:30 PM

BARUG continues its series on R and Hadoop on Tuesday Oct 11 with a presentation from Antonio Piccolboni, developer on the RHadoop Project. We'll also have a lightning talk about Quantbench, a platform for financial data analysis and exploration. As usual, networking and refreshments will start at 6:30, followed at 7:00 by a lightning talk and our main presentation. Thanks to eBay for providing a venue for this month's meeting.

Agenda

6:30 - 7:00 Networking and pizza (sponsored by Revolution Analytics)
7:00 - 7:10 Introductions and Announcements
7:20 - 7:30 Lightning Talk
  Paul Sutter, Introduction to Quantbench
7:30 - 8:30 Keynote Presentation
  Antonio Piccolboni, The RHadoop Project

rmr is a new package that allows to perform mapreduce computations in R, part of the RHadoop open source project connecting R and the Hadoop ecosystem, spearheaded by Revolution Analytics. In this session I will show what the package can do and cover several examples from machine learning. En route I will try to convince you that we did strike the right compromise of power and usability and that you should contribute to this project.

About the speakers:

Antonio Piccolboni is a data scientist with both industrial and academic experience. His recent work includes the design and implementation of a big data analysis package in R, social network analysis  for a top 20 global web site and web analytics for a major web ratings company. He is currently an independent consultant with clients including Dataspora and Revolution Analytics. He blogs at blog.piccolboni.info about big data and analytics. His papers have received more than 800 citations and his Erdős number is 3.

Paul Sutter is Quantbench's Co-Founder and CEO. He most recently held the role of President and Co-Founder of Quantcast. He was instrumental in the development of Quantcast’s distributed computing architecture, which collects 12 billion records per day, processing petabytes of data on a daily basis, with over 10m customer websites. In 2010 the company was recognized as the 3rd most innovative web company by Fast Company (behind Facebook and Google). Prior to co-founding Quantcast, Paul started the WAN optimization company Orbital Data, which was acquired by Citrix in August 2006. Previously, he had also founded Transium, an internet search services company, which was acquired by AltaVista in 2000.

Join or login to comment.

  • John-Mark A.

    Great introduction to a new area.

    October 13, 2011

  • Ariel E.

    Very exciting, and I very much like the way Antonio thinks about it, as well as how he presented his thoughts.

    October 12, 2011

  • Ram N.

    The Quantbench lightning talk was exactly on the mark. Antonio's presentation was great - just the right amount of code for such a vast audience. (If I had prior experience with Hadoop or MapReduce, I could have gotten even more from the talk.)

    October 12, 2011

  • Vlad S.

    Very interested. But the presentation was lacking real demos.

    October 12, 2011

  • Chuck S.

    Great talk on a the really meaningful need to scale the expressiveness of R with the crunching power of hadoop

    October 12, 2011

  • Antonio P.

    RSVP list is frozen at this point, so thanks for thinking about that but we can't add any more people. Your question is very application dependent and no general answer is possible, I believe, without further information. The only general suggestion is to control the maximum number of records with the same key in the reduce phase, as a list containing all those values needs to be allocated. We may consider more iterator-like apis in the future to deal with this problem.

    October 11, 2011

  • Rabi K.

    Hi Antonio, Looks like some people are in the waiting list. So I am Ok to change my RSVP. But I have some questions on Solution Architecture side of Hadoop analytics, like memory requirement and warning on large memory requirement at Map Reduce+R state. Where is the line to draw for running R in ask tracker VS. running R on Hadoops output Any performance analysis to decide the approach A) run Map Reduce with large output, then run R, B)Map, run rmr and then run reduce or collective stat.

    October 11, 2011

Our Sponsors

People in this
Meetup are also in:

Create a Meetup Group and meet new people

Get started Learn more
Bill

I started the group because there wasn't any other type of group like this. I've met some great folks in the group who have become close friends and have also met some amazing business owners.

Bill, started New York City Gay Craft Beer Lovers

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy