
About us
Stream processing/real time event processing is everywhere. This group's goal is to showcase some of the cutting edge developments that are happening in stream processing in the Industry. The focus of the meetup will be Apache Kafka, Apache Samza, Apache Flink, Change Data Capture, Lambda/Kappa Architecture and such. Hosted by Linkedin.
Past meetup talks are available at https://www.youtube.com/playlist?list=PLZDyxA22zzGx34wdHESUux2_V1qfkQ8zx
Upcoming events
1
![[In-Person + Online] Stream Processing with Apache Kafka, Samza, and Flink](https://secure.meetupstatic.com/photos/event/9/0/b/e/highres_516517054.jpeg)
[In-Person + Online] Stream Processing with Apache Kafka, Samza, and Flink
·OnlineOnline- Venue: Building 4 -Together (Meeting Room) -- 800 E Middlefield Rd, Mountain View, CA 94043
- Zoom: https://linkedin.zoom.us/j/97461945938
5:30 - 6:00: Networking [in-person only + catered food]
6:00 - 6:05: Welcome
6:05 - 6:40: Operating Postgres Change Data Capture at Massive Scale
Sai Srirampur, Director of Product, ClickHouse
ClickHouse is a large-scale user of Postgres Change Data Capture through ClickPipes, continuously streaming over 200 TB of data per month from Postgres across 300+ customers. Some individual customer environments exceed 60 TB, moving tens of terabytes monthly via CDC-driven ingestion.In this talk, we’ll share our journey of scaling Postgres CDC to this level—covering the Postgres-specific logical replication optimizations, operational challenges, and lessons learned along the way, and how we refined hundreds of edge cases to make CDC truly production-ready at massive scale.
We’ll take you behind the scenes of how we operate Postgres CDC day-to-day: deploying fleets of ingestion clusters through Kubernetes, executing non-disruptive upgrades across hundreds of tenants, and building end-to-end observability, alerting, and introspection around CDC pipelines—giving both our teams and our customers deep visibility into lag, throughput, and system reliability.
Attendees will gain practical insights into what it takes to run enterprise-grade Change Data Capture from Postgres at global scale.- Sai is a Director at ClickHouse, where he leads all Postgres efforts. Prior to ClickHouse, he was the CEO and Co-founder of PeerDB, a database replication company which was acquired by ClickHouse. Before that, he was a leader on the Microsoft Postgres team and an early engineer at Citus Data.
6:40 - 7:15: Kafka Consumer QoS: Lag-Sorted Assignment + Partition Priority Filtering
Sid Anand, Raghu Baddam, Bali Singh, Ravinder Matte, Walmart
Standard Kafka partition assignors (Range, Round Robin, Sticky) distribute partitions evenly by partition count but ignore message lag. This creates a common problem: some consumers drain small lags quickly while others get stuck on high-lag partitions, slowing overall throughput and recovery time after traffic spikes or outages. There is also no built-in way to concentrate processing on selected partitions or skip troubled ones without changing topics.This talk covers three (open-source potential) assignment strategies that address this gap.
- The lag-based assignor calculates per-partition lag during rebalance and sort partitions by lag before assignment (and make sure that multiple high lag partitionswon’t get assigned to a single consumer instance), so the high-lag partitions will be drained faster.
- The priority partition filter assignor adds include/exclude filtering (with range support) to focus consumers from high priority partitions or bypass stuck ones (poison pill cases) without topic reconfiguration.
- The weighted priority assignor that assigns proportionally more consumer capacity (QoS) to the selected partitions (high priority group) over other partitions (low priority group).
Key Takeaways
- Why round-robin assignment fails under uneven partition lag
- How to configure lag-based sorting with minimal overhead
- When to use partition filtering (include/exclude) for operational control
- Fallback strategies and compatibility with cooperative rebalancing
- Performance trade-offs: lag calculation cost vs throughput gain
- Sid Anand: Fellow, Cloud & Data Platform, Walmart Global Tech. Former Chief Architect & Head of Engineering at Datazoom and Chief Data Engineer at PayPal. Earlier roles at Netflix, LinkedIn, eBay, Etsy. BS/MS CS (Distributed Systems), Cornell. Advisor to startups and conferences; former Apache Airflow committer.
- Raghu Baddam, Senior software engineer with 10+ years in distributed systems, event streaming, and data infrastructure. I build Kafka consumer frameworks focused on latency, reliability, and operational simplicity. My work spans Java-based messaging platforms, large-scale stream processing, and tooling for production Kafka clusters.
- Bali Singh, I am a senior engineer with two decades of experience building high-throughput, low-latency distributed systems. My work focuses on architecting end-to-end data platforms, including large-scale data lakes, real-time streaming ingestion, and core eventing/messaging services.
- Ravinder Matte, Software Engineering Director experienced in large-scale system design and senior team mentorship. Technologies: Scala, Java, Cassandra, Storm, Spark, Kafka, distributed databases.
7:15 - 7:50: Powering Stateful Joins at Scale with Flink SQL at LinkedIn
Weiqing Yang, Pravesh Gupta, LinkedIn
At LinkedIn, we operate real-time pipelines that join client and server-side events to power downstream analytics and metrics. These stateful joins are business-critical across products like Notifications, Search, Jobs, and Premium.In this session, we'll share how we built a scalable Flink SQL-based solution to support these complex joins. As part of our platform evolution, we migrated over a dozen production pipelines from Samza to Flink - unlocking richer SQL capabilities, streamlined job development, and unified platform support. The migration also enabled us to decommission legacy Samza pipelines and Couchbase stores, resulting in significant cost savings.
Key takeaways include:- Ensuring reliability at scale through state compatibility, SQL tuning, and runtime stability
- Diagnosing Join operator I/O bottlenecks and developing a resource consumption formula, leading to 80% hardware cost savings
- Building platform tooling such as topic-level startpoint configuration and capacity readiness checks
- Automating reconciliation and backfill workflows with minimal user input
- Introducing an end-to-end testing framework tailored for event join validation
We'll also share lessons learned along the way and the benefits of adopting a declarative, SQL-first model for managing large-scale stream processing.
- Weiqing has been working in big data computation frameworks since 2015. She is currently a software engineer on the Streaming Infra team at LinkedIn, working on Flink, Samza, K8s, etc. Before that, she worked on Spark at Hortonworks. Weiqing is an active member of the tech community and a frequent conference speaker, having presented at Spark Summit, HBaseCon, KubeCon + CloudNativeCon North America, ApacheCon, Flink Forward, and Current.
- Pravesh is a Tech Lead at LinkedIn specializing in real-time distributed systems at scale. Since 2016, he has been architecting high-throughput data pipelines and event-driven systems. His notable work includes building a real-time OLAP analytical platform on Apache Druid from scratch. Currently, he leads the Tracking Infrastructure team at LinkedIn, designing systems that process billions of events with low latency and high reliability. He's deeply interested in stream processing, data infrastructure, and tackling the unique challenges of operating distributed systems at LinkedIn scale.
35 attendees
Past events
33