Skip to content

Details

​CDC pipelines into Apache Iceberg are straightforward to start and surprisingly hard to maintain at scale. The moment you move beyond full loads into continuous CDC, you accumulate equality delete files, small Parquet fragments, and schema drift that silently degrade query performance over time. This workshop is built around that exact problem.

​The session walks through the full ingestion path: configuring OLake to replicate a PostgreSQL source into S3-backed Iceberg tables, querying that data live with Trino via Starburst, and then confronting the real operational challenges that appear after the first few CDC cycles. Small files, delete file overhead, snapshot history, and compaction are all on the table.

​Here is what the agenda covers:

[To be covered by Nayan] Configuring OLake for PostgreSQL replication (source config, catalog discovery, and sync modes), running the sync and seeing how OLake writes Parquet files into S3 as Iceberg tables after a full load, CDC in practice by running incremental syncs and inspecting the resulting small data files and equality delete files, and understanding why file accumulation matters in terms of MOR read overhead and what the layout looks like before any maintenance.

​[To be covered by Lester] From there, the session moves into querying with Starburst by connecting to Iceberg, validating ingested data, and running live queries. You will also explore Iceberg snapshot history and time travel by querying previous states using snapshot IDs, then tackle compaction with Starburst to merge small files, reduce delete overhead, and measure the actual performance difference. The session closes with best practices for Iceberg table maintenance in CDC pipelines.

​Bring a laptop if you want to follow along. The address will be announced closer to the date, so keep an eye on updates to this event.

Related topics

Big Data
Data Engineering
Database Professionals
Data Lakes

You may also like