The folks at AppNexus will be kindly providing space, pizza, and beer for the October HBase NYC Meetup / Strata+Hadoop World HBase Meetup!
Here's the agenda:
6pm arrive + network
6:30pm Announcments and Talks
- A Hadoop and HBase Use Case: Scaling the AppNexus Data Pipeline, by AppNexus's Director of Engineering
- Fast Map Reduce over HBase, by Keith Wyss and Casey Stella from Explorys
- Continuuity New HBase Contributions, Jonathan Gray from Continuuity.
7:30pm HBase unconference ( http://en.wikipedia.org/wiki/Unconference )
9:00pm Depart and go to local beverage establishment.
Title: A Hadoop and HBase Use Case: Scaling the AppNexus Data Pipeline
Speaker: AppNexus's Director of Engineering
With 2000% growth in 2011, tens of millions of dollars of transactions taking place daily, a data pipeline bursting at the seams, and 24/7/365 uptime, AppNexus engineers faced a task likened to changing the engine of a 747 in midflight. Our data pipeline processes 12 terabytes of data every day generated by more than 20 billion ad calls, running hundreds of jobs simultaneously to generate aggregations crucial to the health of our ad platform. Data is processed on an hourly basis, aggregated with the previous hour¹s data and pushed upstream into reporting.
In order to horizontally scale our pipeline and data reporting, we adapted a variety of technologies, with Hadoop, HBase, and Hive composing the core elements. In this talk, we will share our lessons learned in terms of hardware application, monitoring, rollback, failover nodes, controlling memory allocation, and day-to-day fire drills. We will focus on data flow between different systems, integration details, job scheduling, common pitfalls, solutions we have developed, and the configuration and tuning of both Hadoop and Hbase.
Title: Fast Map Reduce over HBase
Speakers: Keith Wyss and Casey Stella from Explorys
By far, our primary access pattern to HBase at Explorys is via Map Reduce Jobs. That being said, making them as fast as possible is of high value for us. After reading the dist-lists and hearing a few people talk about possibly reading the data raw from underneath the RegionServers, we were skeptical but intrigued. There were some online implementations (ported from Scala) that served at proof of concept for reading out the KeyValue objects, but we wanted a drop in replacement for TableInputFormat. This hope lead to a descent into the cavern of insanity that we lovingly call HFileMergedResultInputFormat. Initial evaluation led us to believe it was faster than TableInputFormat for our use-cases, but upon further inspection turning on Scanner Caching saw all of the performance improvements evaporate.
The input format was fragile and required a high price in terms of data access mechanisms that are disconnected from the typical Region Administration tools. In particular, we spin up some customized read-only region server objects overtop of an active region's files. For our use case we simplified complexities of region administration by shutting off data feeds to a table when a job is run. We will discuss some other strategies to achieve sanity checks in a changing world depending on differing data access needs and demonstrate that these concerns would disappear if the much discussed HDFS hard-links became a reality. It didn't work out, but we'll take you through what we learned while experimenting with an alternate input format that dragged us through the bowels of HBase. The whole effort is open sourced to boot.
Title: Continuuity New HBase Contributions
Speaker: Jonathan Gray from Continuuity
Continuuity is building a new product on top of HBase and Hadoop. This talk will go into detail about the different modifications and features we have implemented within and on top of HBase, including a sophisticated queueing system and transaction engine. We will share our plans for open sourcing these contributions to the HBase community, as well as other contributions we are developing and plan to contribute to the rest of the Apache ecosystem.