- 6:00 - 6:30 PM - Socialize over food and beer(s), General announcements
- 6:30 - 7:00 PM - Session I: DistCp Redux and the Dynamic InputFormat
- 7:00 - 7:30 PM - Session II: Impala - Real-time Queries for Apache Hadoop
- 7:30 - 8:00 PM - Session III: Cloud-Friendly Hadoop and Hive
Session I (6:30 - 7:00 PM) : DistCp Redux and the Dynamic InputFormat
DistCp (distributed copy) is a popular tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. This talk will cover the rationale behind the DistCp rewrite for Hadoop 23, the design, new features and a performance comparison with legacy. It will also introduce a different approach to balancing load across mapper tasks via the DynamicInputFormat.
Speaker: Mithun Radhakrishnan, Software Engineer, Yahoo!
Session II (7:00 - 7:30 PM) : Impala - Real-time Queries for Apache Hadoop
The Cloudera Impala project is for the first time making scalable parallel database technology, which is the underpinning of Google's Dremel as well as that of commercial analytic DBMSs, available to the Hadoop community. With Impala, the Hadoop community now has an open-sourced codebase that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators. This talk will start out with an overview of Impala from the user's perspective, followed by a presentation of Impala's architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives, and traditional data warehouse infrastructure.
Speaker: Mark Grover, Software Engineer, Cloudera
Session III (7:30 - 8:00 PM) : Cloud-Friendly Hadoop and Hive
The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.
Speaker: Ashish Thusoo, CEO, Qubole
Yahoo Campus Map:
Location on Wikimapia: