Implementing, optimizing and using Apache Druid at scale

Big Things
Big Things
Public group
Location image of event venue

Details

AGENDA
========
18:00 - 18:30 - Mingling and food :)
18:30 - 19:10 - Funnel Analysis with Spark and Druid - Itai Yaffe (Big Data Tech Lead) @ Nielsen
19:20 - 20:00 - Scalable Incremental Index for Druid - Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media
20:10 - 20:50 - A day in the life of a Druid implementor and Druid’s roadmap - Benjamin Hopp (Solutions Architect) @ Imply
20:50-21:00 - Wrap-up and closing remarks

*********************** Note: ***********************
* All sessions will be delivered in English
* There is a free 3-hours parking in TLV Fashion mall (5 minutes walk from the venue) and free parking at Givon parking for Discount bank card holders
*****************************************************

Title: Funnel Analysis with Spark and Druid

Abstract:
Every day, millions of advertising campaigns are happening around the world.
As campaign owners, measuring the ongoing campaign effectiveness (e.g “how many distinct users saw my online ad VS how many distinct users saw my online ad, clicked it and purchased my product?”) is super important.
However, this task (often referred to as “funnel analysis”) is not an easy task, especially if the chronological order of events matters. So, while the combination of Druid and ThetaSketch aggregators can answer some of these questions, it still can’t answer the question "how many distinct users viewed the brand’s homepage FIRST and THEN viewed product X page?"
In this talk, we will discuss how we combine Spark, Druid and ThetaSketch aggregators to answer such questions at scale.

Title: Scalable Incremental Index for Druid

Abstract:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2). I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster. The current I^2
Implementation is based on the traditional ordered JDK key-value (KV-)map. We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java. With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM. The project is expected to become GA in 2020.

Title: A day in the life of a Druid implementor and Druid’s roadmap

Abstract:
Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets. This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit. Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.

The most important contributor to a fast analytical setup is getting the data model right. The talk will center around various choices you can make to prepare your data to get best possible query performance.

We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes. We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed.

We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage. You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more. And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.