Skip to content

50th Bay Area Hadoop User Group (HUG) Monthly Meetup

Photo of Yahoo! HUG Organizer
Hosted By
Yahoo! HUG O.
50th Bay Area Hadoop User Group (HUG) Monthly Meetup

Details

Agenda:

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 7:00 - Apache Flink: Fast and reliable large-scale data processing

7:00 - 7:30 - Efficient user environment to facilitate working with Big Data

7:30 - 8:00 - Using HBase Co-Processors to Build a Distributed, Transactional RDBMS

Sessions:

Session 1 (6:30 - 7:00 PM) - Apache Flink: Fast and reliable large-scale data processing

Apache Flink (incubating) is one of the latest addition to the Apache family of data processing engines. In short, Flink’s design aims to be as fast as in-memory engines, while providing the reliability of Hadoop. Flink contains (1) APIs in Java and Scala for both batch-processing and data streaming applications, (2) a translation stack for transforming these programs to parallel data flows and (3) a runtime that supports both proper streaming and batch processing for executing these data flows in large compute clusters.

Flink’s batch APIs build on functional primitives (map, reduce, join, cogroup, etc), and augment those with dedicated operators for iterative algorithms, and support for logical, SQL-like key attribute referencing (e.g., groupBy(“WordCount.word”). The Flink streaming API extends the primitives from the batch API with flexible window semantics.

Internally, Flink transforms the user programs into distributed data stream programs. In the course of the transformation, Flink analyzes functions and data types (using Scala macros and reflection), and picks physical execution strategies using a cost-based optimizer. Flink’s runtime is a true streaming engine, supporting both batching and streaming. Flink operates on a serialized data representation with memory-adaptive out-of-core algorithms for sorting and hashing. This makes Flink match the performance of in-memory engines on memory-resident datasets, while scaling robustly to larger disk-resident datasets.

Finally, Flink is compatible with the Hadoop ecosystem. Flink runs on YARN, reads data from HDFS and HBase, and supports mixing existing Hadoop Map and Reduce functions into Flink programs. Ongoing work is adding Apache Tez as an additional runtime backend.
This talk presents Flink from a user perspective. We introduce the APIs and highlight the most interesting design points behind Flink, discussing how they contribute to the goals of performance, robustness, and flexibility. We finally give an outlook on Flink’s development roadmap.

Speakers:

Kostas Tzoumas, Co-founder and CEO, Data Artisans, Committer Apache Flink

Kostas Tzoumas is a committer at Apache Flink and co-founder of data Artisans (data-artisans.com), a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Kostas was a postdoctoral researcher at TU Berlin, received a PhD in Computer Science from Aalborg University and has been with the University of Maryland, College Park, and Microsoft Research in Redmond in the course of several internships.

Stephan Ewen, Co-founder and CTO, Data Artisans, Committer Apache Flink

Stephan Ewen is a committer at Apache Flink and co-founder of data Artisans (data-artisans.com), a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere) at TU Berlin. Stephan has a PhD in Computer Science from TU Berlin, and has been with IBM Almaden Research and the Microsoft Research in the course of several internships.

Session 2 (7:00 - 7:30 PM) - Efficient user environment to facilitate working with Big Data

We are developing solutions that make it easier to work with data on Hadoop regardless the skill level of the user (whether a developer, an analyst or something else.) The core engine we use and the user-interface have been in research and under development for several years and we are only now starting to “expose” it to the world at large. To give you a sense for our approach, we will provide a variety of demos and discuss the underlying technology to demonstrate the time savings when working with Hadoop.

Speaker:

Shevek, CompilerWorks

Shevek is an expert programmer with a strong interest in parallel and distributed systems. He has worked on cutting edge research in compilers and language design, algorithmic optimization, systems and security. He is capable of maintaining a very straight face under questioning on topics including “Why is our printer playing ‘happy birthday’?” or “What is that message doing on the side of that building?” He received a Doctorate in Computing on the Formalization of Protection Systems from the University of Bath, England. He also holds a Masters in Pure Mathematics and an epee.

Session 3 (7:30 - 8:00 PM) - Using HBase Co-Processors to Build a Distributed, Transactional RDBMS

Monte Zweben Co-Founder and CEO of Splice Machine, will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.

Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.

In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.

HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.

The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.

Speaker:

Monte Zweben, Co-Founder and CEO of Splice Machine

A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

Yahoo Campus Map:

Detail map (http://photos4.meetupstatic.com/photos/event/2/8/e/d/600_21370477.jpeg)

Location on Wikimapia:

http://wikimapia.org/#lang=en&lat=37.418163&lon=-122.025061&z=18&m=b&search=yahoo

Photo of Bay Area Hadoop Meetup group
Bay Area Hadoop Meetup
See more events