Skip to content

52nd Bay Area Hadoop User Group (HUG) Meetup

Photo of Yahoo! HUG Organizer
Hosted By
Yahoo! HUG O.
52nd Bay Area Hadoop User Group (HUG) Meetup

Details

Agenda:

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 7:00 - Demystifying Big Data and Apache Spark

7:00 - 7:30 - The latest of Apache Hadoop YARN and running your docker apps on YARN

7:30 - 8:00 - CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Sessions:

Session 1 (6:30 - 7:00 PM) - Demystifying Big Data and Apache Spark

This is an introductory talk for those who want to get into Big Data and learn about Spark, but don't know where to start. Spark is a fast easy-to-use general-purpose cluster computing framework for processing large datasets. It has become the most active open-source big data project.

The talk will start with an introduction to Big Data, the challenges associated with it, and how organizations are getting value out of it. Next, Mohammed will discuss some of the important Big Data technologies created in the last few years. Then he will dive into Spark and talk about its role in the Big Data ecosystem. Specifically, he will cover the following:

a) Why Spark has set the Big Data world on fire

b) Why people are replacing Hadoop MapReduce with Spark

c) What kind of applications really benefit from Spark

d) Overview of Spark's high-level architecture

Finally, he will introduce the key libraries that come pre-packaged with Spark and discuss how these libraries simplify a variety of analytical tasks:

a) Interactive analytics

b) Stream processing

c) Graph analytics

d) Machine learning

Speakers:

Mohammed Guller is the principal architect at Glassbeam, where he leads the development of advanced and predictive analytics products. He is also the author of the recently published book, "Big Data Analytics with Spark." He is a Big Data and Spark expert. He is frequently invited to speak at Big Data–related conferences. He is passionate about building new products, Big Data analytics, and machine learning.

Over the last 20 years, Mohammed has successfully led the development of several innovative technology products from concept to release. Prior to joining Glassbeam, he was the founder of TrustRecs.com, which he started after working at IBM for five years. Before IBM, he worked in a number of hi-tech start-ups, leading new product development.

Mohammed has a master’s of business administration from the University of California, Berkeley, and a master’s of computer applications from RCC, Gujarat University, India.

Session 2 (7:00 - 7:30 PM) - The latest of Apache Hadoop YARN and running your docker apps on YARN

Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner.

In the first half of the talk, we are going to give a brief look into some of the big efforts cooking in the Apache Hadoop YARN community.

We will then dig deeper into one of the efforts - supporting Docker runtime in YARN. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this half, we'll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN. Support for container runtimes (including the docker container runtime) was recently added to the Linux Container Executor (YARN-3611 and its sub-tasks). We’ll walk through various aspects of running docker containers under YARN - resource isolation, some security aspects (for example container capabilities, privileged containers, user namespaces) and other work in progress features like image localization and support for different networking modes.

Speakers:

Vinod Kumar Vavilapalli is the Hadoop YARN and MapReduce guy at Hortonworks. He is a long term Hadoop contributor at Apache, Hadoop committer and a member of the Apache Hadoop PMC. He has a Bachelors degree from Indian Institute of Technology Roorkee in Computer Science and Engineering. He has been working on Hadoop for nearly 9 years and he still has fun doing it. Straight out of college, he joined the Hadoop team at Yahoo! Bangalore, before Hortonworks happened. He is passionate about using computers to change the world for better, bit by bit.

Sidharta Seethana is a software engineer at Hortonworks. He works on the YARN team, focussing on bringing new kinds of workloads to YARN. Prior to joining Hortonworks, Sidharta spent 10 years at Yahoo! Inc., working on a variety of large scale distributed systems for core platforms/web services, search and marketplace properties, developer network and personalization.

Session 3 (7:30 - 8:00 PM) - CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning.

Yahoo introduced CaffeOnSpark (https://github.com/yahoo/CaffeOnSpark) to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe (https://github.com/BVLC/caffe) and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.

Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib).

Speakers:

Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure.

Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel.

Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.

Photo of Bay Area Hadoop Meetup group
Bay Area Hadoop Meetup
See more events