Big Data Application Meetup 07/27


Details
Shout out to Cask (http://cask.co/) for kindly hosting and sponsoring this meetup!
Cask will also be giving away a BB-8 App-Enabled Droid (http://www.sphero.com/starwars). Enter the raffle on the day of the event for a chance to win.
AGENDA
6:00 - 6:30 - Socialize over food and beer(s)
6:30 - 8:00 - Talks
TALKS
Talk #1: Building Large Scale Applications on Apache Hadoop YARN with Apache Twill - by Poorna Chandra, Cask
Talk #2: Introduction to large-scale Machine Learning with Apache Flink, by Theodore Vasiloudis, SICS
Talk #3: Ambry: Linkedin's Scalable Geo-distributed Object Store, by Sivabalan Narayanan, LinkedIn
ABSTRACTS
Talk #1: Building Large Scale Applications on Apache Hadoop YARN with Apache Twill - by Poorna Chandra, Cask
Twill is an Apache incubator project that provides higher level abstraction to build distributed systems applications on YARN. Developing distributed applications using YARN is challenging because it does not provide higher level APIs, and lots of boiler plate code needs to be duplicated to deploy applications. Developing YARN applications is typically done by framework developers, like those familiar with Apache Flink or Apache Spark, who need to deploy the framework in a distributed way.
By using Twill, application developers need only be familiar with the basics of the Java programming model when using the Twill APIs, so they can focus on solving business problems. In this talk I present how Twill can be leveraged and an example of Cask Data Application Platform (CDAP) that heavily uses Twill for resource management.
Talk #2: Introduction to large-scale Machine Learning with Apache Flink, by Theodore Vasiloudis, SICS
Apache Flink is an open source platform for distributed stream and batch data processing. In this talk we will show how Flink's streaming engine and support for native iterations make it an excellent candidate for the development of large scale machine learning algorithms.
This talk will focus on FlinkML, a new effort to bring scalable machine learning tools to the Flink community. We will provide an introduction to the library, illustrate how we employ some state-of-the-art algorithms to make FlinkML truly scalable, and provide a view into the challenges and decisions one has to make when designing a robust and scalable machine learning library.
Finally, if time permits, we will demonstrate how one can perform some interactive analysis using FlinkML and the notebook environment of Apache Zeppelin.
Talk #3: Ambry: Linkedin's Scalable Geo-distributed Object Store, by Sivabalan Narayanan, LinkedIn
Ambry is an open-source geo-distributed highly available and horizontally scalable object store built at LinkedIn. It is an active-active, immutable, eventually consistent handle store that can be configured to provide different levels of consistency. At LinkedIn, Ambry runs on hundreds of nodes spanning multiple data centers and is the source of truth for media and other immutable content.
The talk starts with discussing the need for a scalable, geo-distributed and highly available object store in a media centric world and how Ambry acts as a single source of truth for all immutable needs for Linkedin. We will go over some of the design decisions that helped Ambry to scale for both large and small objects and how these helped to solve the main pain points of some of the existing problems. In addition, talk also covers the use-cases for which one could use Ambry for. Second part of the talk goes over the architecture of Ambry and the talk ends with our road map.
SPEAKER BIOS
• Poorna Chandra is a software engineer at Cask where he is responsible for building software fueling the next generation of data applications. He is also PMC member for Apache Twill, and PPMC member for Apache Tephra. Prior to Cask, he developed big data infrastructure at Greenplum and Yahoo!
• Theodore Vasiloudis is a Machine Learning researcher, currently performing an internship at Pandora Media. He lives and works in Stockholm at the Swedish Institute of Computer Science (SICS) and is a PhD Candidate at KTH Royal Institute of Technology. His main research interests include large-scale machine learning, graph processing and natural language processing. He is also a contributor to the machine learning library for Apache Flink, FlinkML
• Sivabalan Narayanan is a distributed systems enthusiast working with Linkedin for the past 2 years. He has been working in Distributed Data Systems during his tenure at LinkedIn, Ambry to be specific. He was one of the early engineers of this geo-distributed object store. Anything related to distributed storage and processing excites him.
ARRIVAL AND PARKING
Cask HQ is a few minutes walk from the California Avenue Caltrain Station.
Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby:
http://photos2.meetupstatic.com/photos/event/5/b/2/f/600_438983343.jpeg

Big Data Application Meetup 07/27