Open Lakehouse meetup (ft. Apache Iceberg): Building Scalable Data Platforms
Details
## Open Lakehouse Meetup (ft. Apache Iceberg): Building Scalable Data Platforms
Date: 20th December 2025
Time: 1:30 PM – 5:00 PM IST
Venue: Cloudera Bangalore Office
In Collaboration With: Datazip
Registration: RSVP for event will make you attend
***
### About the Event
Join us for a power-packed Open Lakehouse Meetup featuring Apache Iceberg, where industry experts will deep-dive into modern lakehouse architectures, Iceberg innovations, Python-based data engineering, and high-performance query engines.
Whether you’re a data engineer, platform architect, or analytics professional, this meetup will equip you with hands-on insights into building scalable, cost-effective, and future-proof data platforms using Apache Iceberg and open-source technologies.
***
## 🗓️ Event Agenda
### ✅ 1:00 – 1:30 PM | Registration & Networking
Sign up, settle in, and network with fellow data practitioners.
***
### 🎤 1:30 – 2:00 PM | Talk 1
Apache Iceberg: Key Innovations So Far & What’s Next for Developers in v4
Speaker: Dipankar Mazumdar, Director – Developer Relations, Cloudera
Discover how Iceberg evolved from v1 → v2 → v3, and what’s coming next in Spec v4.
Key takeaways include:
- Row lineage & deletion vectors
- New statistics models
- Metadata redesign
- File Format API & engine interoperability
- Future roadmap for Iceberg developers
***
### 🎤 2:00 – 2:30 PM | Talk 2
Simple Data Engineering Without a Cluster – A Tour of PyIceberg & PyArrow
Speaker: Diptiman Raichaudhuri, Staff Developer Advocate, Confluent
Apache Iceberg has become the de-facto Open Table Format for modern lakehouses. With PyIceberg and PyArrow, data engineers can now perform pure Python CRUD operations and transformations on Iceberg tables—without needing heavy distributed compute engines.
This session will cover:
- Pythonic CRUD on Iceberg tables
- In-memory transformations using PyArrow
- Interoperability with DuckDB, Pandas & Apache Arrow
- Building lightweight, fast ETL pipelines without clusters
***
### 🎤 2:30 – 3:00 PM | Talk 3
OLake: Solving Modern Data Ingestion Challenges with Apache Iceberg and Arrow
Speaker: Ankit Sharma & Badal Singh, Software Engineer, OLake
As data ingestion pipelines scale, Apache Iceberg's default write strategies reveal critical bottlenecks: - metadata bloating from parallel writes - file conflicts between concurrent writers - Inconsistent file sizing - Degraded query performance This talk explores how OLake addresses these production challenges while introducing a columnar-first architecture powered by Apache Arrow.
***
### 🎤 3:00 – 3:30 PM | Talk 4
Apache Iceberg with Dataproc Lightning Engine
Speakers: Vishal Karve & Haymant Mangla, Software Engineers, Google
As organizations explore alternatives to JVM-based Spark for better price-performance, this session dives into next-gen execution engines like DataFusion, Velox, and Comet—and their integration with Apache Iceberg. You’ll also learn about the ongoing work in Dataproc Lightning Engine to enable seamless Iceberg interoperability.
***
### 🎤 3:30 – 4:00 PM | Talk 5
Breaking Down Silos: Building an Open, Zero-Copy Data Mesh with the Iceberg REST Catalog
Speaker: Akshat Mathur, Product manager, Cloudera
Data interoperability is the backbone of a successful data mesh, yet traditional catalogs often act as walled gardens. To build a truly composable data platform, we need a standard that prioritizes openness and secure access over storage location.
Join us as we dive into the Iceberg REST Catalog specification—the open standard that is redefining data access. We will demonstrate how the REST Catalog acts as a gateway, facilitating seamless interoperability between diverse compute engines and enabling a Zero-Copy model where "sharing" replaces "copying" (ETL). We will cover: The Interoperability Problem: Moving beyond the limitations of the Hive Metastore and file-system catalogs.
Zero-Copy Architecture: Using the REST protocol to vend access tokens dynamically, allowing secure, ephemeral access to data in place. The Open Ecosystem: A look at how Cloudera is implementing this standard to democratize data access.
***
### ☕ 4:00 – 5:00 PM | Networking & Snacks
Wrap up the day with great conversations, food, and community networking.
***
## 🎯 Who Should Attend?
- Data Engineers
- Analytics Engineers
- Platform Architects
- Cloud Engineers
- Open Source & Lakehouse Enthusiasts
***
## 🌟 Attendee Benefits & Key Takeaways
- Run pure Python PyIceberg CRUD operations without clusters
- Understand how Iceberg specs evolved across versions
- Learn the real-world impact of Iceberg v3 features
- Preview upcoming innovations in Iceberg v4
- See a complete open-source lakehouse stack in action
***
## 🎤 Speaker Bios (Highlights)
Diptiman Raichaudhuri – Staff Developer Advocate at Confluent, with deep expertise across Kafka, Flink, Spark, Iceberg, DuckDB, LLMs, and large-scale cloud platforms.
Vishal Karve & Haymant Mangla – Software Engineers on Google Cloud Dataproc team, focused on Apache Iceberg performance and optimization.
Dipankar Mazumdar – Director of Developer Advocacy at Cloudera, contributor to Apache Iceberg, author of multiple lakehouse publications, and speaker at global data conferences.
Akshat Mathur - Product Manager @Cloudera Open Data Lakehouse, Contributor of open source projects like Apache Hive and Apache Tez
Ankit Sharma & Badal Singh - Ankit is our lead and founding engineer at Datazip, and a core part of the execution behind OLake. He's currently working on building compaction for small parquet files, and passionate about solving data engineering problems and open source contributions.
Badal Singh: He's a software engineer, works on OLake focusing on building ingestion pipelines using Apache Arrow. He is an Apache Iceberg Go contributor.

