What we're about

This meetup is focused on the Future of Data and the open community data projects governed by the Apache Software Foundation. Geared towards developers, data scientists and ALL Data enthusiasts who are building modern data applications. Our meetups cover all data -- data-in-motion and data-at-rest. Meetups provide an opportunity to listen, share and work hands on with other technologists in the open source and open community Apache tools.

Upcoming events (1)

Special Hybrid Event - Apache Iceberg: Looking Below the Waterline

Network event

Link visible for attendees

In a very short span of time since its advent, Apache Iceberg has become the most popular, fastest growing and widely adopted open table format in the big data space. It addresses some of the known big data pain points around data consistency, scalability, performance, schema and partition evolution. In this meetup, you'll hear from the key partners in the open source community leading and driving Iceberg enhancements and roadmap.

This registration page is specifically for those who would like to attend this meetup virtually. If you would like to attend in person, please register using the link in the last paragraph below.

We have a full agenda; here's a summary of the four talks we plan to deliver:

Apache Iceberg for BI use cases
Speakers: Vincent Kulandaisamy and Shaun Ahmadian, Cloudera
This talk will cover the integration of Iceberg open table format with Apache Hive and Impala compute engines, Iceberg v1 and v2 capabilities support, customer use cases and future Iceberg enhancements and innovations in the works at Cloudera. We'll take a detailed look into the following capabilities supported in Hive and Impala:

  • Critical functional and performance enhancements
  • Materialized views support
  • In-place Table migration of Hive external to Iceberg tables
  • Row level update/delete
  • Table rollback
  • Table maintenance

Learn how Teranet keeps up with the changing growth and requirements of their business using Apache Iceberg for their change data capture use case leveraging Spark and Impala.

Multi-function Analytics with Apache Iceberg
Speaker: Wing Yew Poon, Cloudera
This session will present a demonstration of using Spark with Iceberg tables, highlighting key Iceberg features. We'll show the interoperability of Spark with Hive and Impala. Along the way, we'll cover Cloudera's contributions for improving Spark and Impala support on Iceberg.

Apache Iceberg's REST Catalog - Real and Potential Uses Beyond Data Workflows
Speaker: Samual Redai, Netflix
Iceberg's new REST catalog provides a friendly access point for the rich metadata and functionality that comes with an Iceberg-powered data warehouse. This makes Iceberg even easier to integrate into compute engines and makes catalog operations available from pretty much any client you can imagine. However, the power of the REST catalog doesn't stop there. There are a myriad of tools and features that sit on the edge of the data platform that benefit highly from the REST catalog design. In this talk, Sam will cover a few creative uses that currently exist as well as some imaginative uses that could exist.

Incremental compaction using Apache Iceberg
Speaker: Vikram Bohra, Linkedin
At Linkedin, streaming data in the form of Kafka topics is ingested to the data lake by low-latency ingestion pipelines powered by Apache Gobblin. This often leads to smaller files that can contain duplicate records due to at-least once delivery semantics, which lead to the creation of another set of pipelines that deduplicate data for correctness and compact into larger files for storage and query efficiency.
Those compaction pipelines are bursty, compute intensive and have higher latency due to their batch processing nature. With the increase in data volume, it becomes increasingly important to process/compute data in an incremental fashion for optimal resource utilization and lower latency. In this talk, we present how Linkedin leverages Iceberg to migrate its compaction pipelines from batch to incremental processing models and solve such latency and compute problems. We also show how that leads to an improvement in overall cluster resource utilization and more uniform workload distribution. Furthermore, we will also focus on how we optimize compaction and data deduplication in light of late data.

Our local group in Santa Clara, CA is holding this event. We thought it might be of interest to our wider membership, so we are also supporting an online simulcast originating in Pacific Standard Time (the event time displayed on this page will reflect the equivalent local time). You are welcome to sign up for it here.

If you are local to the Santa Clara Valley and you'd like to attend in-person, you can register using that group's registration page (registration will close promptly on Dec 5th)..

Past events (18)

Special Hybrid Event: Apache Ozone User Group Summit

This event has passed

Photos (92)