- Vespa (Big Data Serving Engine) + Bullet (Real-Time Data Query Engine)
*** To attend, please register here: https://www.eventbrite.com/e/vespa-big-data-serving-engine-bullet-real-time-data-query-engine-tickets-53079502220 *** Hi Everyone! We’d love to invite you to a meetup on December 5th hosted by Oath (Yahoo, AOL, and many other tech companies - a Verizon subsidiary) in Sunnyvale - Yahoo, Building G, 2nd Floor. Pizza, cookies, and refreshments will be served! There will be plenty of time for conversation, networking, and Q&A. Please see special driving instructions below. Agenda 5:30pm - 6:30pm: Pizza, refreshments, cookies, and networking. 6:30pm - 7:15pm: “Introduction to Vespa (vespa.ai) – the open source big data serving engine (Yahoo)” Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents. Speaker Bio: Jon Bratseth is a distinguished architect at Oath (Yahoo, AOL, and many other tech companies - a Verizon subsidiary) and the architect and one of the main contributors to Vespa, the open big data serving engine. Jon has 20 years experience as an architect and programmer on large distributed systems. He has a masters in computer science from the Norwegian University of Science and Technology. 7:15pm - 8pm: “Bullet: Open Source Real-Time Data Query Engine” Bullet (https://github.com/bullet-db) is an open-sourced, lightweight, scalable, pluggable, multi-tenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet queries look forward in time - they are submitted first and operate on data flowing through the system from the point of submission and can run forever. Bullet addresses the challenges of supporting intractable Big Data aggregations like Top K, Counting Distincts, and Windowing efficiently without having a storage layer using Sketch-based algorithms. Speaker Bios: - Michael Natkovich, Director Software Dev Engineering, Oath Michael serves as the architect and director for Oath's next generation stream processing, batch processing, experimentation, and general data tools. Problems he’s dealt with focus on increasing scale, reducing latency, improving operability, focusing on customer satisfaction, while driving quality in data and engineering best practices. - Nate Speidel, Software Engineer, Oath In addition to expanding and supporting Bullet, Nate is focused on Oath's data processing pipeline - extracting, transforming, and storing big data. Nate has a Masters Degree in Computer Science, from the University of California, San Diego. 8pm - 8:30pm: Networking - - - Special Driving and Building Instructions: Yahoo, Building G, is located on the corner of Mathilda and Java. From 237 or 101 Take Mathilda North to First Avenue. Turn right at the light. Turn right into the driveway and go to the second building. Please park in the parking lot behind Building G. The lobby of G is between the two buildings - you should see a bunch of small grassy hills between the buildings. Once inside G, please take the elevators to the 2nd floor.
- 57th Bay Area Hadoop User Group HUG Meetup
It’s been a long time since our last meetup! We are looking at rebooting this meetup to work with the community to share more info on recent releases and how folks can start to leverage the latest innovations. Apache Hadoop has had several recent releases (3.0.x and 3.1.x) with many new enhancements. The community continues to innovate with upcoming release coming in the future. In this meetup, the Hadoop community members will share information, use cases and the community's experience. If you are interested in a particular topic or would like to speak at a future event, please reach out to the HUG meetup leadership team. This meetup will be focused on Apache Hadoop 3.1. There will be summary talks on YARN and HDFS enhancements to start. In the next set of meetups, we invite community members for deeper dives. 6:30 – 6:55 PM Networking / Social 7:00 – 9:00 PM Presentations YARN in Apache Hadoop 3.x: Updates & Demos Description Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc. In this talk, we’ll start with the current status of Apache Hadoop YARN in Apache Hadoop 3.1.x —how it is used today. We'll then cover the present and future of YARN—features that are further strengthening YARN as the first class resource management platform for data centers running enterprise Hadoop. Speaker(s) Wangda Tan / Vinod Kumar Vavilapalli Storage in Apache Hadoop 3.x: HDFS’s evolution and Ozone Introduction Description HDFS has several strengths: horizontal scaling of IO bandwidth over petabytes of storage. Further it provides very low latency metadata operations and scales to over 60K concurrent clients. Apache Hadoop 3.0 recently added Erasure Coding, Multiple NameNode support and HDFS federation improvements. We will talk about the latest HDFS enhancements in Apache Hadoop 3.0 and what the road ahead might look like... One of HDFS’s limitations is scaling number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone filesystem. It allows Hadoop to scale to tens of billions of files. Ozone fundamentally separates the namespace layer and the block layers. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We will provide a high-level overview of the Ozone architecture. Speaker(s) Hanisha Koneru/Arpit Agarwal
- Vespa (open big data serving engine) Meetup
MANDATORY REGISTRATION: https://goo.gl/forms/7kK2vlaipgsSSSH42 Vespa meetup with various presentations from the Vespa team. Vespa (http://vespa.ai) is the open big data serving engine to store, search, rank and organize big data at user serving time. Several Vespa developers from Norway are in Sunnyvale, use this opportunity to learn more about the open big data serving engine Vespa and meet the team behind it. WHEN: Monday, December 4th, 6:00pm - 8:00pm PDT WHERE: Oath/Yahoo Sunnyvale Campus Building E, Classroom 9 &[masked] First Avenue, Sunnyvale, CA 94089 Agenda 6.00 pm: Welcome & Intro 6.15 pm: Vespa tips and tricks 7.00 pm: Tensors in Vespa, intro and usecases 7.45 pm: Vespa future and roadmap 7.50 pm: Q&A This meetup is a good arena for sharing experience, get good tips, get inside details in Vespa, discuss and impact the roadmap, and it is a great opportunity for the Vespa team to meet our users. Hope to see many of you!
- 56th Bay Area Hadoop User Group (HUG) Meetup
DataWorks / Hadoop Summit Special. Summit is less than two weeks away. Register now (https://dataworkssummit.com/san-jose-2017/attend/passes/) and enter YAHOO20 for 20% off your all-access pass. Location: San Jose Convention Center Room: LL20A Agenda: 6:00 - 6:30 - Network and Socialize 6:30 - 7:00 - Large-Scale Machine Learning: Use Cases and Technologies 7:00 - 7:30 - Flexible and Scalable Compute Resource Management with Apache Hadoop YARN for Large Organizations 7:30 - 8:00 - YARN Scheduling – A Step Beyond Sessions: Session 1 (6:30 - 7:00 PM) - Large-Scale Machine Learning: Use Cases and Technologies In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data. A collection of distributed algorithms have been developed to achieve[masked]x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark (https://github.com/yahoo/caffeonspark) and TensorFlowOnSpark (https://github.com/yahoo/tensorflowonspark), available as open source. In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning. Speaker Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan. Session 2 (7:00 - 7:30 PM) - Flexible and Scalable Compute Resource Management with Apache Hadoop YARN for Large Organizations With increases in compute workloads and a growing number of users with diverse business use cases, each with varying resource availability requirements, cluster admins require an operationally flexible and scalable way to maintain high cluster utilization while ensuring resource allocation fairness across business organizations. To this end, we added new improvements to Hadoop YARN which allow for: Dynamically configuring cluster and queue configurations via API/CLI, Finer control over queue capacities, for example specifying absolute resources instead of percentages for queue capacity, and Better control of queue hierarchy by supporting queue add/remove/rename/move without restarting ResourceManager. This talk will first go over our motivations for improving queue management. Next, we will go through each enhancement with examples of how to use it. Finally, we will show how LinkedIn uses these enhancements for a multi-thousand node clusters not only to facilitate queue management, but also to build tools which improve compute utilization and resource usage monitoring. Speaker Jonathan Hung (Linkedin), Xuan Gong (Hortonworks) Session 3 (7:30 - 8:00 PM) - YARN Scheduling – A Step Beyond In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler: Global Scheduling Support General placement support Better preemption model to handle resource anomalies across and within queue. Absolute resources’ configuration support Priority support between Queues and Applications In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today. Speaker Sunil Govind(Hortonworks), Jian He (Hortonworks)
- 55th Bay Area Hadoop User Group (HUG) Meetup
Agenda: 6:00 - 6:30 - Socialize over food and beer(s) 6:30 - 7:00 - Data Sketches: A required toolkit for Big Data Analytics 7:00 - 7:30 - Exactly-once end-to-end processing with Apache Apex 7:30 - 8:00 - Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps Sessions: Session 1 (6:30 - 7:00 PM) - Data Sketches: A required toolkit for Big Data Analytics In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study. Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling. Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics. Session 2 (7:00 - 7:30 PM) - Exactly-once end-to-end processing with Apache Apex Apache Apex ( http://apex.apache.org/ ) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees. Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle. Session 3 (7:30 - 8:00 PM) - Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs. This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place. Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
- 54th Bay Area Hadoop User Group (HUG) Meetup
Agenda: 6:00 - 6:30 - Socialize over food and beer(s) 6:30 - 7:00 - The Pillars of Effective Data Archiving and Tiering in Hadoop 7:00 - 7:30 - Architecture of an Open Source RDBMS powered by HBase and Spark 7:30 - 8:00 - Pulsar, a highly scalable, low latency pub-sub messaging system Sessions: Session 1 (6:30 - 7:00 PM) - The Pillars of Effective Data Archiving and Tiering in Hadoop This talk will cover utilizing native Hadoop storage policies and types to effectively archive and tier data in your existing Hadoop infrastructure. Key focus areas are: 1. Why use heterogeneous storage (tiering)? 2. Identifying key metrics for successful archiving of Hadoop data 3. Automation requirements at scale 4. Current limitations and gotchas The impact of successful archive provides Hadoop users better performance, lower hardware cost, and lower software costs. This session will cover the techniques and tools available to unlock this powerful capability in native Hadoop. Peter Kisich works with multiple large scale Hadoop customers successfully tiering and optimizing Hadoop infrastructure. He co-founded FactorData to bring enterprise storage features and control to open Hadoop environments. Previously, Mr. Kisich served as a global subject matter expert in Big Data and Cloud computing for IBM including speaking at several global conferences and events. Session 2 (7:00 - 7:30 PM) - Architecture of an Open Source RDBMS powered by HBase and Spark Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings. Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science. Session 3 (7:30 - 8:00 PM) - Pulsar, a highly scalable, low latency pub-sub messaging system Yahoo recently open-sourced Pulsar, a highly scalable, low latency pub-sub messaging system running on commodity hardware. It provides simple pub-sub messaging semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. Pulsar is used across various Yahoo applications for large scale data pipelines. Learn more about Pulsar architecture and use-cases in this talk. Joe Francis from Pulsar team at Yahoo
- 53rd Bay Area Hadoop User Group (HUG) Meetup
Agenda: 6:00 - 6:30 - Socialize over food and beer(s) 6:30 - 7:00 - Open Source Big Data Ingest with StreamSets Data Collector 7:00 - 7:30 - Better together: Fast Data with Apache Spark™ and Apache Ignite™ 7:30 - 8:00 - Recent development in Apache Oozie Sessions: Session 1 (6:30 - 7:00 PM) - Open Source Big Data Ingest with StreamSets Data Collector Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we'll look at how SDC's "intent-driven" approach keeps the data flowing, whether you're processing data 'off-cluster', in Spark, or in MapReduce. StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations. Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community. Session 2 (7:00 - 7:30 PM) - Better together: Fast Data with Apache Spark™ and Apache Ignite™ Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain will explain in detail how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications and workers. Dmitriy will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Don't miss this opportunity to learn from one of the experts how to use Spark and Ignite better together in your projects. Dmitriy Setrakyan, is a founder and CPO at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy also acts as PMC chair of Apache Ignite project. Session 3 (7:30 - 8:00 PM) - Recent development in Apache Oozie First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs. Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer. Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
- 52nd Bay Area Hadoop User Group (HUG) Meetup
Agenda: 6:00 - 6:30 - Socialize over food and beer(s) 6:30 - 7:00 - Demystifying Big Data and Apache Spark 7:00 - 7:30 - The latest of Apache Hadoop YARN and running your docker apps on YARN 7:30 - 8:00 - CaffeOnSpark: Distributed Deep Learning on Spark Clusters Sessions: Session 1 (6:30 - 7:00 PM) - Demystifying Big Data and Apache Spark This is an introductory talk for those who want to get into Big Data and learn about Spark, but don't know where to start. Spark is a fast easy-to-use general-purpose cluster computing framework for processing large datasets. It has become the most active open-source big data project. The talk will start with an introduction to Big Data, the challenges associated with it, and how organizations are getting value out of it. Next, Mohammed will discuss some of the important Big Data technologies created in the last few years. Then he will dive into Spark and talk about its role in the Big Data ecosystem. Specifically, he will cover the following: a) Why Spark has set the Big Data world on fire b) Why people are replacing Hadoop MapReduce with Spark c) What kind of applications really benefit from Spark d) Overview of Spark's high-level architecture Finally, he will introduce the key libraries that come pre-packaged with Spark and discuss how these libraries simplify a variety of analytical tasks: a) Interactive analytics b) Stream processing c) Graph analytics d) Machine learning Speakers: Mohammed Guller is the principal architect at Glassbeam, where he leads the development of advanced and predictive analytics products. He is also the author of the recently published book, "Big Data Analytics with Spark." He is a Big Data and Spark expert. He is frequently invited to speak at Big Data–related conferences. He is passionate about building new products, Big Data analytics, and machine learning. Over the last 20 years, Mohammed has successfully led the development of several innovative technology products from concept to release. Prior to joining Glassbeam, he was the founder of TrustRecs.com, which he started after working at IBM for five years. Before IBM, he worked in a number of hi-tech start-ups, leading new product development. Mohammed has a master’s of business administration from the University of California, Berkeley, and a master’s of computer applications from RCC, Gujarat University, India. Session 2 (7:00 - 7:30 PM) - The latest of Apache Hadoop YARN and running your docker apps on YARN Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner. In the first half of the talk, we are going to give a brief look into some of the big efforts cooking in the Apache Hadoop YARN community. We will then dig deeper into one of the efforts - supporting Docker runtime in YARN. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this half, we'll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN. Support for container runtimes (including the docker container runtime) was recently added to the Linux Container Executor (YARN-3611 and its sub-tasks). We’ll walk through various aspects of running docker containers under YARN - resource isolation, some security aspects (for example container capabilities, privileged containers, user namespaces) and other work in progress features like image localization and support for different networking modes. Speakers: Vinod Kumar Vavilapalli is the Hadoop YARN and MapReduce guy at Hortonworks. He is a long term Hadoop contributor at Apache, Hadoop committer and a member of the Apache Hadoop PMC. He has a Bachelors degree from Indian Institute of Technology Roorkee in Computer Science and Engineering. He has been working on Hadoop for nearly 9 years and he still has fun doing it. Straight out of college, he joined the Hadoop team at Yahoo! Bangalore, before Hortonworks happened. He is passionate about using computers to change the world for better, bit by bit. Sidharta Seethana is a software engineer at Hortonworks. He works on the YARN team, focussing on bringing new kinds of workloads to YARN. Prior to joining Hortonworks, Sidharta spent 10 years at Yahoo! Inc., working on a variety of large scale distributed systems for core platforms/web services, search and marketplace properties, developer network and personalization. Session 3 (7:30 - 8:00 PM) - CaffeOnSpark: Distributed Deep Learning on Spark Clusters Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning. Yahoo introduced CaffeOnSpark (https://github.com/yahoo/CaffeOnSpark) to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe (https://github.com/BVLC/caffe) and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck. Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib). Speakers: Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure. Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel. Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.
- 51st Bay Area Hadoop User Group (HUG) Monthly Meetup