What we're about

This is a group for anyone who wants to learn about or share what they know about Apache Flink. We are excited about its potential, and we want to find other people who are interested. Apache Flink is a 'streaming first' data processing engine - which is a lot cooler than it sounds. See here for more: https://flink.apache.org/

Upcoming events (1)

Streaming from Iceberg Data Lake & Multi Cluster Kafka Source

Link visible for attendees

Streaming from Iceberg Data Lake
Steven Zhen Wu, Apple

Apache Iceberg brings numerous benefits like snapshot isolation, transactional commit, fast scan planning and time travel support. Those features solved important correctness and performance challenges for batch processing use cases. While originally adopted for batch, Iceberg can be leveraged as a streaming source. Streaming reads can further reduce the processing delay from hours to minutes compared to periodically scheduled batch ETL jobs.

In this talk, we are going to discuss how the Flink Iceberg source enables streaming reads from Iceberg tables, where long-running Flink jobs continuously poll and process data as soon as committed. We will discuss the design of the source operator focusing in particular on the streaming read mode. We will compare the Kafka and Iceberg sources for streaming read, and discuss how the Iceberg streaming source can power common stream processing use cases. Finally, we will present the performance evaluation results of the Iceberg streaming read

Multi Cluster Kafka Source
Mason Chen, Apple

Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure.

In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.

Photos (44)

Find us also at