- 6:00 - 6:30 - Socialize over food and beer(s)
- 6:30 - 7:00 - Azkaban: What LinkedIn Use to Manage Hadoop Workflows
- 7:00 - 7:30 - Weave: Running YARN apps as simply as running Java threads
- 7:30 - 8:00 - Finding a Needle in a Stack of Needles: Adding Search to the Hadoop Ecosystem
Session I: Azkaban: What LinkedIn Use to Manage Hadoop Workflows
Everyday, LinkedIn updates massive data-sets that power our various online features. Thousands of Hadoop jobs need to be executed reliably in a specific order and on set schedules to support these updates. For several years, LinkedIn has been using Azkaban to coordinate the execution of these jobs on our production and development clusters.
Azkaban is an open-source workflow management platform that runs all of LinkedIn's Hadoop data products. It is based on a reliable and scalable design, and is also highly flexible to be extended with new features and work with different Hadoop components. Azkaban focuses on ease of use, by providing a modern and beautiful web UI, as well as highly customizable job executors.
In this talk, we'll go through the war stories and lessons learned in supporting these workloads on our Hadoop clusters with over a thousand active users and how Azkaban has been redesigned over time to achieve our goals.
Speaker: Richard Park, Software Engineer, LinkedIn
Session II: Weave: Running YARN apps as simply as running Java threads
Hadoop YARN is the new, powerful and highly-flexible resource management framework that allows utilizing a cluster's resources to run MapReduce jobs, as well as other types of applications. However, flexibility comes with complexity and this can make it challenging to get started with YARN. With Weave, we set out to make YARN more accessible to application developers who are familiar with Java but do not have experience with distributed systems. Weave provides a set of libraries that makes writing distributed applications easy through an abstraction layer built over YARN, and it makes running those application as simple as running threads. With the abstraction provided by Weave, an application can be executed in process threads during development and unit testing, and be deployed to a YARN cluster later without any modification. Weave also has built-in support for real-time application logs and metrics collection, application lifecycle management and network service discovery, which greatly reduce the pain that developers face in developing, debugging, deploying and monitoring applications.
Speaker: Terence Yim, Software Engineer, Continuuity
Session III: Finding a Needle in a Stack of Needles: Adding Search to the Hadoop Ecosystem
Apache Hadoop is enabling organizations to collect larger, more varied data - but after it's collected how will it be found? Your users expect to be able to search for information using simple text-based queries -- regardless of data location, size, and complexity. How do they quickly find information that's just been created, or been stored for months or even years?
Cloudera Search team lead Patrick Hunt will present their solution to this problem; what architecture is necessary to search HDFS and HBase? How was Apache Solr, Lucene, Flume
and MapReduce integrated to allow for Near Real Time and Batch indexing of documents? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
Speaker: Patrick Hunt, PMC member on the Apache ZooKeeper project, Cloudera Search Team Lead
Yahoo Campus Map:
Location on Wikimapia: