Past Meetup

Kudu: Data Store for the New Era, with Kafka+Spark+Kudu Demo

This Meetup is past

113 people went

Location visible to members


Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.

Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.

This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.

Presenter: Jean-Daniel Cryans
Jean-Daniel Cryans is a Software Engineer at Cloudera currently working on the Kudu team, and an Apache HBase PMC member. Previous to Cloudera, he worked at StumbleUpon where he worked on HBase while maintaining its production deployment there.


We will start the night with a 15-minute talk by Akshat Aranya, senior software engineer at Quantcast.

Synopsis: The talk focuses on the use of Spark to power an interactive query platform. We will discuss our architecture that uses Spark on top of HBase and Aerospike to enable fast and flexible queries over large amounts of data. We will also discuss some issues we ran up against when executing very short jobs (a few seconds) with Spark.

Presenter: Akshat Aranya is a senior software engineer at Quantcast. He is leading the Interactive Query team, which aims to eliminate combinatorial explosion of batch processing systems by computing results on-the-fly.



6:30 - 7:00 Social (Food & Drinks will be provided)

7:00 - 7:15 Short talk by Akshat Aranya at Quantcast

7:15 - 8:15 Talk by Jean-Daniel Cryans at Cloudera, followed by Q&A