Past Meetup

43rd Bay Area Hadoop User Group (HUG) Monthly Meetup - An Evening on Apache Tez

This Meetup is past

350 people went

Location image of event venue

Details

Agenda

6:00 - 6:30 - Socialize over food and beer(s) 6:30 - 7:00 - Intro to Apache Tez and the Internals 7:00 - 7:30 - Pig on Tez 7:30 - 8:00 - Hive on Tez

Session I (6:30 - 7:00 PM) - Intro to Apache Tez and the Internals

Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable set of data-processing primitives which can be used by other projects. By providing a more expressive DAG of tasks for a job, Tez attempts to provide significantly enhanced data-processing capabilities for projects like Apache Pig, Apache Hive, Cascading etc.

Speaker: Alan Gates, Co-founder and Architect, Hortonworks; Bikas Saha, Member, Technical Staff, Hortonworks

Bio:

Alan is a co-founder at Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan also designed HCatalog and guided its adoption as an Apache Incubator project. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O’Reilly Press.

Bikas Saha has been working on Apache Hadoop for over a year and is a committer on the project. He has been a key contributor in making Hadoop run natively on Windows and has focused on YARN and the Hadoop compute stack.

Session II (7:00 - 7:30 PM) - Pig on Tez

With big data processing geared towards low latency, Pig on Tez aims to make ETL faster by using Tez as the execution engine instead of MapReduce on Hadoop. Tez is a distributed execution framework for executing computations as a dataflow graph that is a more natural fit for Pig query plan. With optimized shorter query plan, custom input/output/processor management, caching with container reuse and avoiding intermediate storage in HDFS, Pig on Tez delivers huge performance improvements over Pig on MapReduce with initial tests showing 2-3x speedup. Pig-on-Tez is a Apache community driven effort led by Hortonworks, Yahoo, Netflix and LinkedIn.

Speaker: Rohini Palaniswamy, Principal Engineer, Yahoo; Cheolsoo Park, Software Engineer, Netflix

Bio:

Rohini currently leads Pig and Oozie development at Yahoo!, and has been working on Hadoop and related projects like Pig, Oozie, HCatalog, Hive, Grid Data Lifecycle Management for the past 5 years at Yahoo! scale. Rohini is a PMC member/committer on the Apache Pig project, and a committer on the Apache Oozie project. She is interested in large-scale data processing and is currently working on Pig-on-Tez which targets low latency ETL on Hadoop.

Cheolsoo Park is an Apache Pig PMC member,committer and the current VP. He is also a senior software engineer at Netflix and works on cloud-based big data analytics infrastructure that leverages open source technologies including Hadoop, Hive and Pig. Cheolsoo holds a Bachelor’s degree in Computer Science from the University of Waterloo and is fascinated by large scale data processing, distributed systems, and cloud computing.

Session III (7:30 - 8:00 PM) - Hive on Tez

Apache Hive is the de-facto standard for SQL-in-Hadoop today, with more enterprises relying on this open source project than any alternative. Apache Tez is a general-purpose data processing framework on top of YARN. Tez provides high performance out of the box across the spectrum of low latency queries and heavy-weight batch processing. In this talk you will learn how interactive query performance is achieved by bringing the two together. We will explore how techniques like container-reuse, re-localization of resources, sessions, pipelined splits, ORC stripe indexes, PPD, vectorization and more work and contribute to dramatically faster start-up and query execution.

Speaker: Gunther Hagleitner, Dev Lead, Hortonworks

Bio:

Gunther Hagleitner has been contributing to various hadoop projects for over four years both at Yahoo! as well as Hortonworks. He is an active committer in the Apache Hive project as well as a PMC member of the Apache Tez project. Before Hadoop, Gunther has been working on database technology for more than a decade. At Hortonworks he is leading Hive efforts in the Stinger project – delivering performance and SQL capabilities in the ecosystem. Gunther holds has a MS in Mathematics from the University of Konstanz.

Yahoo Campus Map:

Detail map (http://photos4.meetupstatic.com/photos/event/2/8/e/d/600_21370477.jpeg)

Location on Wikimapia:

http://www.wikimapia.org/#lat=[masked]&lon=[masked]&z=18&l=0&m=b&search=yahoo