Detailed agenda and summaries:
- 6:00 - 6:30 - Socialize over food and beer(s)
- 6:30 - 7:00 - Data driven local commerce @ Groupon
- 7:00 - 7:30 - Using Apache Hive with HBase and recent improvements
- 7:30 - 8:00 - JuteRC compiler
Data driven local commerce @ Groupon
Groupon started out three years ago as a "deal of the day" company and is rapidly expanding into being one of the largest e-commerce companies on the planet, connecting the worlds of online and offline commerce. In this talk, we give an overview of how Groupon employs a data-driven approach to power local commerce by using Big Data to deliver the right deal to the right consumer at the right time. We'll give a "view from the trenches" on how we've built and grown our relevance technology leveraging Hadoop and other open-source tools.
Presenter: Shawn Jeffery and Sean O'Brien, Groupon
Using Apache Hive with HBase and recent improvements
Apache Hive and HBase are very popular projects in the Hadoop ecosystem. Using Hive with HBase was made possible by contributions from Facebook around 2010. In this talk, we will go over the details of how the integration works, and talk about recent improvements. Specifically, we will cover the basic architecture, schema and data type mappings, and recent filter pushdown optimizations. We will also go into detail about the security aspects of Hadoop/HBase related to Hive setups.
Presenter: Enis Soztutar, Hortonworks
Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.
Presenter: Tanping Wang, Yahoo
Yahoo Campus Map:
Location on Wikimapia: