- 6:00 - 6:30 - Socialize over food and beer(s)
- 6:30 - 7:00 - Removing the NameNode's memory limitation
- 7:00 - 7:30 - Hue: the UI for Apache Hadoop
- 7:30 - 8:00 - Compression Options in Hadoop - A Tale of Tradeoffs
Session I: Removing the NameNode's memory limitation
Current HDFS Namenode stores all of its metadata in RAM. This has allowed Hadoop clusters to scale to 100K concurrent tasks. However, the memory limits the total number of files that a single NameNode can store. While Federation allows one to create multiple volumes with additional Namenodes, there is a need to scale a single namespace and also to store multiple namespaces in a single Namenode.
This talk describes a project that removes the space limits while maintaining similar performance by caching only the working set or hot metadata in Namenode memory. We believe this approach will be very effective because the subset of files that is frequently accessed is much smaller than the full set of files stored in HDFS.
In this talk we will describe our overall approach and give details of our implementation along with some early performance numbers.
Speaker: Lin Xiao, PhD student at Carnegie Mellon University, intern at Hortonworks
Session II: Hue: the UI for Apache Hadoop
Hue is an open source, Web-based interface that makes Apache Hadoop easier to use. Hue’s target is the Hadoop user experience and lets users focus on quick data processing. Hue is a mature Web project that integrates into a single UI the Hadoop components and their main satellite projects.
This talk describes how Hue’s apps like File Browser and Job Browser let you list, move, upload HDFS files or access job logs in a few clicks. Workflows can be built and scheduled repetitively with some drag & drop interfaces and wizards, without having to deal with any Oozie XML.
Hue comes with three editors: Hive, Pig and Impala. Each editor improves readability and productivity by providing cool features like syntax highlighting. Some other apps let you customize Solr search results, browse HBase tables or submit Sqoop jobs. Moreover, Hue comes with a SDK for letting developers reuse its libraries and start building apps on top of Hadoop.
To sum-up, attendees of this talk will learn how Hue can open their Hadoop user base and why it is the ideal client for getting familiar or using the platform.
Speaker: Romain Rigaux, Software Engineer, Cloudera
Session III: Compression Options in Hadoop - A Tale of Tradeoffs
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This talk attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The talk also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Speaker: Govind Kamat, Member of Technical Staff, Yahoo!
Yahoo Campus Map:
Location on Wikimapia: