The Latest in Apache Hive, Spark, Druid and Impala


The Big Data world has many tools and technologies which are suitable in different contexts for different workloads. Hive, Spark, Druid and Impala are well known among those. Come and learn from experts on what's latest in those. We will showcase short high-level talks covering the latest in each project. And then we will have a couple more talks deep-diving on specific features.

Food and beverages will be provided!

6:00 – 6:45: Networking & Food
6:45 – 8:00: Presentation(s)
8:00 – 8:30: Final Questions

Talk #1:
Title: Materialized views and recommendation engine for Apache Hive
Speaker: Jesus Camacho Rodriguez
Abstract: Materialized views were introduced in Apache Hive 3 to accelerate query execution in data warehouses. This talk discusses the support of alternative storage options for the materialized views, the current coverage of our transparent query rewriting algorithm, and how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data. In addition, we provide a sneak peek into our ongoing work of designing a recommendation engine for Apache Hive, a critical component to help users gain insights about their warehouse workloads and speed up their most demanding queries using materialized views.

Talk #2:
Apache Druid: Current status and Roadmap
Speaker: Slim Bouguerra
Abstract: Learn how Druid is used to analyze billions of daily actions on hundreds of different device types across the world. We will discuss current project status and future roadmap to empower data driven decision making without overwhelming cluster owners.

Talk #3:
Title: Synchronized metastore cache in Apache Hive
Speaker: Daniel Dai
Abstract : Metastore cache already exists in Hive. However, the cache is not consistent in the metastore HA setting. It is using eventual consistent model. In this talk, I want to illustrate an improvement on the metastore cache to make it consistent.

Talk #4:
Title: Onto new pastures: Impala in the Cloud
Speakers: Lars Volker, Tim Armstrong
Abstract: We will follow a query through Impala’s planner, scheduler, code generation, and distributed execution, explaining the architecture and the role each component plays along the way. Building on that we will describe ongoing engineering work on improving Impala performance in the Cloud. Running efficiently and robustly in the cloud presents several challenges. In particular, running Impala against cloud object stores like S3 and ADLS is fundamentally different from running Impala colocated with HDFS DataNodes, because all data must be read across the network.

This Meetup is not only for a technical audience, but sales and field professionals are also encouraged to attend. We hope you can make it!