SF / East Bay Area Stream Processing Meetup


Details
-------------------------------------------------------------------
6:45PM to 7:00PM - Socializing
7:00PM to 7:45PM - Introduction to Spark and Tachyon
7:45PM to 8:00PM - Taming the Chaos of Stream Processing
8:00PM to 8:30PM - Socializing
-------------------------------------------------------------------
Introduction to Spark and Tachyon
Bill Zhao (from TubeMogul)
Bill was working as a researcher in the UC Berkeley AMP lab during the creation of Spark and Tachyon, and worked on improving Spark memory utilization and Spark Tachyon integration. The AMP lab Working at the intersection of three massive trends: powerful machine learning, cloud computing, and crowdsourcing, the AMPLab is integrating Algorithms, Machines, and People to make sense of Big Data.
Bin Fan (from Tachyon Nexus)
Bin Fan is a software engineer at Tachyon Nexus. He is one of the top committers in the open source community of Tachyon project. Prior to Tachyon Nexus, he worked in Google to build the core storage infrastructure and won Google's Technical Infrastructure award. Bin got his PhD in computer science from Carnegie Mellon University.
Description:
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, etc. It is designed to perform both batch processing (similar to MapReduce). Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.
Memory is the key to fast Big Data processing. This has been realized by many, and frameworks such as Spark and Shark already leverage memory performance. As data sets continue to grow, storage is increasingly becoming a critical bottleneck in many workloads.
To address this need, we have developed Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. The result of over two years of research, Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read. Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code changes. Tachyon is the default off-heap option in Spark, which means that RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads. The project is open source and is already deployed at multiple companies. In addition, Tachyon has more than 100 contributors from over 30 institutions, including IBM, Yahoo, Intel, Redhat, Baidu, and Tachyon Nexus. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.
Taming the Chaos of Stream Processing
Kulgavin
A technologist at heart, Mr. Kulgavin spent the past two decades in leadership roles of software development, sales and product management. Starting out in the embedded systems world in the late 90s, Mr. Kulgavin pioneered a visual platform to simplify the creation of control automation systems for the smart building, transportation, and oil distribution markets. More recently, Mr. Kulgavin applied lessons learned from real-time embedded systems to the world of big data. More specifically, Mr. Kulgavin led a team to create the MINTDATA real-time stream processing platform. With a stream processor engine built from scratch (yet backward compatible with Apache Storm) and a mechanism to visually define data pipelines, the MINTDATA platform today runs in production and helps people become more efficient at how they define and manage massive streams of data at scale.
Description:
Stream processing is broken -- we spend inordinate amounts of time building, maintaining and deploying the software & infrastructure to manage stream processing pipelines. As an industry, it’s time to stop repeating ourselves and to focus instead on gleaning domain-specific insights from raw data. At MINTDATA, we have one approach to spending less time on infrastructure and more on the data domains at hand. In this brief 10 minute talk, we’ll show an example of how we strive to help companies and people become more efficient at managing stream processing at scale.

SF / East Bay Area Stream Processing Meetup