Skip to content

54th Bay Area Hadoop User Group (HUG) Meetup

Photo of Yahoo! HUG Organizer
Hosted By
Yahoo! HUG O.
54th Bay Area Hadoop User Group (HUG) Meetup

Details

Agenda:

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 7:00 - The Pillars of Effective Data Archiving and Tiering in Hadoop

7:00 - 7:30 - Architecture of an Open Source RDBMS powered by HBase and Spark

7:30 - 8:00 - Pulsar, a highly scalable, low latency pub-sub messaging system

Sessions:

Session 1 (6:30 - 7:00 PM) - The Pillars of Effective Data Archiving and Tiering in Hadoop

This talk will cover utilizing native Hadoop storage policies and types to effectively archive and tier data in your existing Hadoop infrastructure. Key focus areas are:

  1. Why use heterogeneous storage (tiering)?

  2. Identifying key metrics for successful archiving of Hadoop data

  3. Automation requirements at scale

  4. Current limitations and gotchas

The impact of successful archive provides Hadoop users better performance, lower hardware cost, and lower software costs. This session will cover the techniques and tools available to unlock this powerful capability in native Hadoop.

Peter Kisich works with multiple large scale Hadoop customers successfully tiering and optimizing Hadoop infrastructure. He co-founded FactorData to bring enterprise storage features and control to open Hadoop environments. Previously, Mr. Kisich served as a global subject matter expert in Big Data and Cloud computing for IBM including speaking at several global conferences and events.

Session 2 (7:00 - 7:30 PM) - Architecture of an Open Source RDBMS powered by HBase and Spark

Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.

Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

Session 3 (7:30 - 8:00 PM) - Pulsar, a highly scalable, low latency pub-sub messaging system

Yahoo recently open-sourced Pulsar, a highly scalable, low latency pub-sub messaging system running on commodity hardware. It provides simple pub-sub messaging semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. Pulsar is used across various Yahoo applications for large scale data pipelines. Learn more about Pulsar architecture and use-cases in this talk.

Joe Francis from Pulsar team at Yahoo

Photo of Bay Area Hadoop Meetup group
Bay Area Hadoop Meetup
See more events