- 6:00 - 6:30 - Socialize over food and beer(s), General announcements
- 6:30 - 7:00 - HIT (Hadoop Integration Testing) for Automated Certification and Deployments
- 7:00 - 7:30 - A Visual Workbench for Big Data Analytics on Hadoop
- 7:30 - 8:00 - Large Scale Data Ingest Using Apache Flume
Session I (6:30 - 7:00 PM) - HIT (Hadoop Integration Testing) for Automated Certification and Deployment
HIT, which stands for Hadoop Integration Testing, is a Yahoo! framework for assembling Hadoop components into a full Stack and running integration tests to make sure that the components can inter-operate with each other. HIT aims to:
- build fully automated, modular, scalable and flexible Hadoop stack deployment and test framework
- develop integration processes and tools for development, quality engineering, operations, and customers
- grow participation and evolve into a comprehensive self-service stack deployment and test solution
HIT is designed as an open system to plug in any type of testing. We will also share new developments around HIT and how it can be a Platform for all testing and automation.
Presenters: Mukund Madhugiri, Director of Quality and Release Engineering, Cloud Engineering Group, Yahoo!; Baljit Deot, Technical Yahoo!, Cloud Engineering Group, Yahoo!
Session II (7:00 - 7:30 PM) - A Visual Workbench for Big Data Analytics on Hadoop
Two of the major barriers to effective Hadoop deployments in the enterprise are the complexity and limited applicability of MapReduce. Software developers with Hadoop and MapReduce experience are in short supply, slowing big data initiatives. Faster results to a broad range of analytic scenarios require working at a higher level of abstraction, supported by new programming paradigms and tools. In this talk we present one such approach based on our experience developing a visual workbench for big data analytics on Hadoop. This approach enables data scientists and analysts to build and execute complex big data workflows for Hadoop with minimal training and without MapReduce knowledge. Libraries of pre-built operators for data preparation and analytics reduce the time and effort required to develop big data projects on Hadoop. The framework is extensible allowing the addition of new operators as needed. Due to the efficiency of the underlying dataflow framework, the run times are shortened, allowing faster iterations of discovery and analysis.
Presenter: Jim Falgout, Chief Technologist, Pervasive Big Data & Analytics
Session III (7:30 - 8:00 PM) - Large Scale Data Ingest Using Apache Flume
Apache Flume is a highly scalable, distributed, fault tolerant data collection framework for Apache Hadoop and Apache HBase. Flume is designed to transfer massive volumes of event data in a highly scalable way into HDFS or HBase. Flume is declarative and easy to configure and can easily be deployed to a large number of machines using configuration management systems like Puppet or Cloudera Manager. In this talk, we will cover the basic components of Flume, configuring and deploying flume. We will also briefly talk about the metrics Flume exposes, and the various ways in which these can be collected. Apache
Flume is a Top Level Project (TLP) at the Apache Software Foundation, and has made several releases since entering incubation in June, 2011. Flume graduated to become a TLP in July, 2012. The current release of Flume is Flume 1.3.1.
Presenter: Hari Shreedharan, PMC Member and Committer, Apache Flume, Software Engineer, Cloudera
Yahoo Campus Map:
Location on Wikimapia: