56th Bay Area Hadoop User Group (HUG) Meetup

Name: 56th Bay Area Hadoop User Group (HUG) Meetup
Start: 2017-06-12T18:00:00-07:00
End: 2017-06-12T21:00:00-07:00
Location: San Jose Convention Center

Hosted By

Yahoo! HUG O.

56th Bay Area Hadoop User Group (HUG) Meetup

Details

DataWorks / Hadoop Summit Special. Summit is less than two weeks away. Register now (https://dataworkssummit.com/san-jose-2017/attend/passes/) and enter YAHOO20 for 20% off your all-access pass.

Location: San Jose Convention Center

Room: LL20A

Agenda:

6:00 - 6:30 - Network and Socialize

6:30 - 7:00 - Large-Scale Machine Learning: Use Cases and Technologies

7:00 - 7:30 - Flexible and Scalable Compute Resource Management with Apache Hadoop YARN for Large Organizations

7:30 - 8:00 - YARN Scheduling – A Step Beyond

Sessions:

Session 1 (6:30 - 7:00 PM) - Large-Scale Machine Learning: Use Cases and Technologies

In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.

A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark (https://github.com/yahoo/caffeonspark) and TensorFlowOnSpark (https://github.com/yahoo/tensorflowonspark), available as open source.

In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.

Speaker Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.

Session 2 (7:00 - 7:30 PM) - Flexible and Scalable Compute Resource Management with Apache Hadoop YARN for Large Organizations

With increases in compute workloads and a growing number of users with diverse business use cases, each with varying resource availability requirements, cluster admins require an operationally flexible and scalable way to maintain high cluster utilization while ensuring resource allocation fairness across business organizations. To this end, we added new improvements to Hadoop YARN which allow for:

Dynamically configuring cluster and queue configurations via API/CLI,

Finer control over queue capacities, for example specifying absolute resources instead of percentages for queue capacity, and

Better control of queue hierarchy by supporting queue add/remove/rename/move without restarting ResourceManager.

This talk will first go over our motivations for improving queue management. Next, we will go through each enhancement with examples of how to use it. Finally, we will show how LinkedIn uses these enhancements for a multi-thousand node clusters not only to facilitate queue management, but also to build tools which improve compute utilization and resource usage monitoring.

Speaker Jonathan Hung (Linkedin), Xuan Gong (Hortonworks)

Session 3 (7:30 - 8:00 PM) - YARN Scheduling – A Step Beyond

In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:

Global Scheduling Support

General placement support

Better preemption model to handle resource anomalies across and within queue.

Absolute resources’ configuration support

Priority support between Queues and Applications

In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.

Speaker Sunil Govind(Hortonworks), Jian He (Hortonworks)

Events in San Jose, CA