Apache Iceberg: Looking Below the Waterline


Details
Join us live on linkedin if you cannot make it in person.
In a very short span of time since its advent, Apache Iceberg has become the most popular, fastest growing and widely adopted open table format in the big data space. It addresses some of the known pain points around data consistency, scalability, performance, schema and partition evolution. In this meetup, you'll hear from Linkedin, Netflix and Cloudera, partners in the open source community leading and driving Iceberg enhancements and roadmap.
This registration page is specifically for those who would like to attend this meetup in person. If you would like to participate over the web, please register using one of the links in the last paragraph below.
We have a full agenda; here's a summary of the three talks we plan to deliver:
## Apache Iceberg: Foundation for Open Lakehouse
Speakers: Vincent Kulandaisamy and Shaun Ahmadian, Cloudera
This talk will cover the integration of Iceberg open table format with Apache Hive and Impala compute engines, Iceberg v1 and v2 capabilities support, customer use cases and future Iceberg enhancements and innovations in the works at Cloudera. We’ll take a detailed look into the following capabilities supported in Hive and Impala:
- Critical functional & performance enhancements
- Materialized views support
- In-place Table migration of Hive external to Iceberg tables
- Row level update/delete
- Table rollback and maintenance
Learn how Teranet keeps up with the changing growth and requirements of their business using Apache Iceberg for their change data capture use case leveraging Spark & Impala.
## Multi-function Analytics with Apache Iceberg
Speaker: Wing Yew Poon, Cloudera
This session will present a demonstration of using Spark with Iceberg tables, highlighting key Iceberg features. We’ll show the interoperability of Spark with Hive and Impala. Along the way, we’ll cover Cloudera’s contributions for improving Spark and Impala support on Iceberg.
## Apache Iceberg's REST Catalog - Real and Potential Uses Beyond Data Workflows
Speaker: Samual Redai, Netflix
Iceberg's new REST catalog provides a friendly access point for the rich metadata and functionality that comes with an Iceberg-powered data warehouse. This makes catalog operations available from pretty much any client you can imagine. However, the power of the REST catalog doesn't stop there. There are a myriad of tools and features that sit on the edge of the data platform that benefit highly from the REST catalog design. In this talk, Sam will cover a few creative uses that currently exist as well as some imaginative uses that could exist.
## Incremental compaction using Apache Iceberg
Speaker: Vikram Bohra, Linkedin
At Linkedin, streaming data in the form of Kafka topics is ingested to the data lake by low-latency ingestion pipelines powered by Apache Gobblin. This often leads to smaller files that can contain duplicate records due to at-least once delivery semantics, which lead to the creation of another set of pipelines that deduplicate data for correctness and compact into larger files for storage and query efficiency.
Those compaction pipelines are bursty, compute intensive and have higher latency due to their batch processing nature. With the increase in data volume, it becomes increasingly important to process/compute data in an incremental fashion for optimal resource utilization and lower latency. In this talk, we present how Linkedin leverages Iceberg to migrate its compaction pipelines from batch to incremental processing models and solve such latency and compute problems. We also show how that leads to an improvement in overall cluster resource utilization and more uniform workload distribution. Furthermore, we will also focus on how we optimize compaction and data deduplication in light of late data.
##
Come join us at Cloudera's office in Santa Clara. Food and drinks will be served, as well as some cool swag for in-person attendees (limited to first come first serve)! We will also have a dedicated time for socializing and networking both shortly before and after the technical talks.
If you'd like to participate but can't make it in person, many of our "sister groups" are going to be simulcasting the proceedings online. Choose the link in your time zone:
- Pacific Time: Future of Data - Los Angeles
- Mountain Time: Future of Data - Denver
- Central Time: Future of Data - Chicago
- Eastern Time: Future of Data - New York
COVID-19 safety measures

Sponsors
Apache Iceberg: Looking Below the Waterline