What we're about

This meetup is focused on the Future of Data and the open community data projects governed by the Apache Software Foundation. Geared towards developers, data scientists and ALL Data enthusiasts who are building modern data applications. Our meetups cover all data -- data-in-motion and data-at-rest. Meetups provide an opportunity to listen, share and work hands on with other technologists in the open source and open community Apache tools.

Upcoming events (1)

Apache Iceberg: Looking Below the Waterline

5470 Great America Pkwy

In a very short span of time since its advent, Apache Iceberg has become the most popular, fastest growing and widely adopted open table format in the big data space. It addresses some of the known pain points around data consistency, scalability, performance, schema and partition evolution. In this meetup, you'll hear from the key partners in the open source community leading and driving Iceberg enhancements and roadmap.

This registration page is specifically for those who would like to attend this meetup in person. If you would like to participate over the web, please register using one of the links in the last paragraph below.

We have a full agenda; here's a summary of the three talks we plan to deliver:

## Apache Iceberg for BI use cases

Speakers: Vincent Kulandaisamy and Shaun Ahmadian, Cloudera
This talk will cover the integration of Iceberg open table format with Apache Hive and Impala compute engines, Iceberg v1 and v2 capabilities support, customer use cases and future Iceberg enhancements and innovations in the works at Cloudera. We’ll take a detailed look into the following capabilities supported in Hive and Impala:

  • Critical functional & performance enhancements
  • Materialized views support
  • In-place Table migration of Hive external to Iceberg tables
  • Row level update/delete
  • Table rollback and maintenance

Learn how Teranet keeps up with the changing growth and requirements of their business using Apache Iceberg for their change data capture use case leveraging Spark & Impala.

## Multi-function Analytics with Apache Iceberg

Speaker: Wing Yew Poon, Cloudera
This session will present a demonstration of using Spark with Iceberg tables, highlighting key Iceberg features. We’ll show the interoperability of Spark with Hive and Impala. Along the way, we’ll cover Cloudera’s contributions for improving Spark and Impala support on Iceberg.

## Apache Iceberg's REST Catalog - Real and Potential Uses Beyond Data Workflows

Speaker: Samual Redai, Netflix
Iceberg's new REST catalog provides a friendly access point for the rich metadata and functionality that comes with an Iceberg-powered data warehouse. This makes catalog operations available from pretty much any client you can imagine. However, the power of the REST catalog doesn't stop there. There are a myriad of tools and features that sit on the edge of the data platform that benefit highly from the REST catalog design. In this talk, Sam will cover a few creative uses that currently exist as well as some imaginative uses that could exist.

## Incremental compaction using Apache Iceberg

Speaker: Vikram Bohra, Linkedin
At Linkedin, streaming data in the form of Kafka topics is ingested to the data lake by low-latency ingestion pipelines powered by Apache Gobblin. This often leads to smaller files that can contain duplicate records due to at-least once delivery semantics, which lead to the creation of another set of pipelines that deduplicate data for correctness and compact into larger files for storage and query efficiency.
Those compaction pipelines are bursty, compute intensive and have higher latency due to their batch processing nature. With the increase in data volume, it becomes increasingly important to process/compute data in an incremental fashion for optimal resource utilization and lower latency. In this talk, we present how Linkedin leverages Iceberg to migrate its compaction pipelines from batch to incremental processing models and solve such latency and compute problems. We also show how that leads to an improvement in overall cluster resource utilization and more uniform workload distribution. Furthermore, we will also focus on how we optimize compaction and data deduplication in light of late data.

##

Come join us at Cloudera's office in Santa Clara. Food and drinks will be served, as well as some cool swag for in-person attendees (limited to first come first serve)! We will also have a dedicated time for socializing and networking both shortly before and after the technical talks.

We have a capacity limit for folks attending in person, so be sure to pre-register (registration will close promptly on Dec 5th). You will be asked to "sign in" as a visitor to the building so that security won't inadvertently escort you off the premises.

If you'd like to participate but can't make it in person, many of our "sister groups" are going to be simulcasting the proceedings online. Choose the link in your time zone:

Past events (58)

Apache Ozone - User Group Summit

5470 Great America Pkwy

Photos (180)