Past Meetup

A Spark Auto Scaling Kudu Sneak Peek in 3s

This Meetup is past

175 people went

Location image of event venue


Hi, everyone. I’m excited to announce an incredible line-up of three awesome presenters and talks for the Nov 5th event hosted by Capital One. Agenda is also below. See you there!


6:00 – 6:30 Networking and food

6:30 – 6:40 Welcome & Introductions

6:45 – 7:15 Capital One: Spark and Auto Scaling

7:20 – 7:50 Cloudera: Kudu: a new storage engine for fast analytics on fast data

7:55 – 8:25 Databricks: A Sneak Peek into Apache Spark 1.6

8:30 – 8:45 Close & Wrap-up

CAPITAL ONE: Spark and Auto Scaling

For the upcoming Spark meetup on Nov 5th, Capital One will be showcasing its adoption of Spark by deep diving into its Spark cluster setup on AWS using CloudFormation and Chef. The session will focus on walking through the scripts that launch a 4 node cluster on AWS, deep dive into Auto Scaling capability of the cluster and auto creation & deletion of the stack. This will be an interactive session that will include a demo of these capabilities, an opportunity to share learnings/feedback and ask questions!

Presenter Bio: Saurabh Gupte

Saurabh Gupte is a Software Engineer at Capital One and Spark Certified Developer. He is currently working on the team that is standing up a Spark platform for processing data at rest. Saurabh has over 10 years of experience in architecting, designing & developing ETL and data movement applications.

CLOUDERA: Kudu: a new storage engine for fast analytics on fast data

Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.

Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.

This talk introduces Kudu, a new open source storage engine that fills the gap described above. It will also highlight its integration with Apache Spark and demo an end-to-end data ingestion pipeline that uses Kafka, Spark Streaming, and Kudu which will then be queried using Impala.

Presenter Bio: Jean-Daniel Cryans

Jean-Daniel Cryans is a Software Engineer at Cloudera currently working on the Kudu team, and an Apache HBase PMC member. Previous to Cloudera, he worked at StumbleUpon where he worked on HBase while maintaining its production deployment there.

DATABRICKS: A Sneak Peek into Apache Spark 1.6

Pat McDonough, Director of Client Solutions at Databricks, will provide a sneak peek into Apache Spark 1.6 - from RDD to DataFrames to Datasets. Spark 1.6 will include (but not limited to) adaptive query execution [SPARK-9850], a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], and unified memory management by consolidating cache and execution memory [SPARK-10000]. Come to this Washington DC Area Spark Interactive session to learn more and ask questions about Spark 1.6!

Presenter Bio: Pat McDonough

Pat McDonough is one of Databricks' first employees, focused on helping customers build their Spark-based applications and data pipelines. For the past two years, he's played an important role in helping the Spark project grow to the most active project in Big Data, working hand-in-hand with the impressive engineering team that continues to push exciting new technology in to the project, and finding great ways to use this technology in the real world. Before joining Databricks, Pat held positions at Cloudera, Red Hat, and AT&T.


Capital One (


Note - Park in the parking garage next to the 8020 Towers Crescent Drive building, 3rd floor and bring your parking ticket into the event. It will be validated at the event.