49th Bay Area Hadoop User Group (HUG) Monthly Meetup


Details
Agenda:
6:00 - 6:30 - Socialize over food and beer(s)
6:30 - 7:00 - Apache Pig 0.14
7:00 - 7:30 - Apache Tez: A Performance View into Large Scale Data-processing, Shuffle Throughput, Reducer parallelism and Reducer Skew
7:30 - 8:00 - Lessons from Hadoop 2+Java8 migration at LinkedIn
Session I (6:30 - 7:00 PM) – Apache Pig 0.14
Pig 0.14 is being released in the second week of November and the talk will cover its exciting new features and major performance gains with a new execution engine and better query plans. Pig 0.14 boasts of some major features like (1) Pig on Tez which allows Tez as an alternative execution engine to MapReduce and gives a huge performance boost with lesser resource consumption, (2) Support for natively reading Orc files through OrcStorage, (c) Support for Predicate Pushdown for Loaders which makes filtering very fast for loaders like OrcStorage, (d) New logical optimizer rules for predicate pushdown and constant calculation, and (e) Usability improvements in terms of shipping jars and APIs to automatically ship jars from user code. Pig 0.14 is a major step in providing support for alternate execution engines in Pig with Pig on Tez expected to gain major traction going forward.
Speakers:
Rohini Palaniswamy, Principal Engineer, Yahoo and Apache Pig, Oozie, and Tez PMC
Rohini currently leads Pig and Oozie development at Yahoo!, and has been working on Hadoop and related projects like Pig, Oozie, HCatalog, Hive, Grid Data Lifecycle Management for the past 5 years at Yahoo! scale. Rohini is a PMC member/committer on the Apache Pig, Oozie, and Tez projects. She is interested in large-scale data processing and is currently working on Pig-on-Tez which targets low latency ETL on Hadoop.
Daniel Dai, Member of Technical Staff, Hortonworks and Apache Pig PMC, Apache Hive Committer
Daniel is an Apache Pig PMC member/committer and Apache Hive Committer involved with Pig for 5 years at Yahoo and now at Hortonworks. He has a PhD in Computer Science with specialization in computer security, data mining and distributed computing from University of Central Florida. He is interested in data science, large scale processing, Hadoop, Pig, Hive, and more.
Session II (7:00 - 7:30 PM) – Apache Tez: A Performance View into Large Scale Data-processing, Shuffle Throughput, Reducer parallelism and Reducer Skew
Apache Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications. The end goal is to provide a window into the machinations of Tez, not as developers but as end-users of Hive, Pig, Cascading or Scalding etc and cut through a few abstractions that are common between all these processing tools. For this purpose, the Tez UI provides direct access to the basic information about the runtime and job specifics.
For the practical minded & outside of a basic UI, we will pick some common themes faced by large scale data-processing systems, with solutions of a Tez flavour - bad machines, shuffle throughput, reducer parallelism and reducer skew.
Speaker:
Gopal Vijayaraghavan, Performance Lead, Hortonworks
Gopal Vijayaraghavan is a late entry into the Hadoop game, having started working on it in 2012. He works on Apache Hive and Apache Tez as part of the Stinger initiative, fixing query performance at scale.
Session III (7:30 - 8:00 PM) – Lessons from Hadoop 2+Java8 migration at LinkedIn
Hadoop has been a critical part of LinkedIn’s massive data infrastructure by providing reliable data storage and efficient processing frameworks. To accommodate increasing amounts of data and processing overhead, LinkedIn has been an early adopter of the Hadoop ecosystem. Since last year, the Hadoop team at LinkedIn has evaluated the Hadoop 2/YARN framework and currently is migrating existing clusters to YARN. LinkedIn is also leading the effort to run Hadoop on the latest JDK 8. Additionally, new versions of Pig, Hive, & Azkaban have been deployed which are Hadoop 2 compliant. During LinkedIn’s migration, the Hadoop team closely worked with the open source community by reporting issues using Apache Jira and submitting patches to upstream projects. In this presentation, I will cover the challenges and discoveries made while migrating thousands of jobs from Hadoop 1 to Hadoop 2 at LinkedIn.
Speakers:
Mohammad Kamrul Islam, VP, Apache Oozie at ASF, Staff Software Engineer at LinkedIn, Committer of Apache Tez
Mohammad Islam is currently working at LinkedIn in the Hadoop development team as a Staff Software Engineer. Previously, he worked at Yahoo for nearly five years as an Oozie architect/technical lead. He has been intimately involved with the Apache Hadoop ecosystem since 2008. Mohammad has a Ph.D. in Computer Science with a specialization in parallel job scheduling from Ohio State University. He is a PMC chair of Apache Oozie and a PMC member of Apache TEZ.
Adam Faris, Apache Hadoop and Hive contributor, Staff Engineer at LinkedIn
Adam Faris is part of the "Grid Operations & Systems" team at LinkedIn. His team is responsible for configuring, deploying, and maintaining, the Hadoop infrastructure at LinkedIn. During his 3 years at LinkedIn, Adam has participated in implementing Hadoop's Kerberos Authentication Layer with Hadoop 1 as well as deploying Hadoop 2.
Yahoo Campus Map:
Detail map (http://photos4.meetupstatic.com/photos/event/2/8/e/d/600_21370477.jpeg)
Location on Wikimapia:
http://www.wikimapia.org/#lat=37.4181633&lon=-122.0250607&z=18&l=0&m=b&search=yahoo

Sponsors
49th Bay Area Hadoop User Group (HUG) Monthly Meetup