Skip to content

Details

  • In-Person Venue: Building 950 - Unify (Meeting Room, 1st Floor) - 950 W Maude Ave, Sunnyvale, CA 94085 (Just show up, no registration link needed!)
  • Online Registration Link: click here (Only required if you are joining virtually via Teams)

5:30 - 6:00: Networking [in-person only + catered food]
6:00 - 6:05: Welcome
6:05 - 6:40: Forkable Shared Logs: Enabling Testing, Analysis, and Agentic Workloads on Real-Time Data
Shreesha G. Bhat, Ph.D. student in Computer Science at the University of Illinois Urbana-Champaign
Streaming platforms are great at processing live data, but surprisingly poor at experimentation. Want to validate a new application on production traffic? Replay a stream with different business logic? Let multiple AI agents explore alternative actions from the same state? These workflows typically require copying data, running parallel pipelines, or maintaining separate development environments that mirror production.

This talk introduces forkable shared logs, a new abstraction that allows applications to create lightweight branches of a stream and process them independently. Much like branching in version control systems, stream forks enable developers and applications to experiment on realistic workloads without affecting production execution.

I will present AgileLog, a shared-log system designed around this abstraction, discuss the systems challenges behind efficiently forking streams, and show how forkable streams can enable testing, analysis, and emerging agentic workloads on real-time data.

  • Shreesha G. Bhat is a Ph.D. student in Computer Science at the University of Illinois Urbana-Champaign, advised by Aishwarya Ganesan and Ram Alagappan. His research focuses on distributed storage systems, with an emphasis on distributed shared logs and data streaming. His recent work includes SpecLog (OSDI'25) and LazyLog (SOSP'24 Best Paper Award). His work has appeared at top systems conferences such as OSDI, SOSP, and EuroSys, received a Best Paper Award, and been invited to journals. Ideas from LazyLog have also seen early industry adoption, influencing WarpStream's Lightning Topics feature. Shreesha received his Dual Degree (B.Tech. + M.Tech.) in Computer Science from IIT Madras.

6:40 - 7:15: Modernizing Flink Jobs at Scale: A Platform Approach to Flink 2 Upgrade
Daniel Trager, Mark Cho, Netflix
At Netflix, we operate over 25,000 Apache Flink jobs, processing more than 100 PB of data per day across use cases as varied as data movement, personalization, ML feature platform, and ads. That diversity shows up in wildly different job characteristics: Flink jobs running at parallelism from 1 to 3,000, state from near-zero to multiple TBs, and job graphs ranging from trivial to deeply complex.

That same diversity is what makes version upgrades complex. Every breaking change in Apache Flink 2 (incompatible state, connector interface changes, dependency conflicts, changed defaults) is manageable for a single job. Multiplied across tens of thousands of jobs spread over many teams, each upgrade presents countless failure cases that must either be handled by the platform or by the users.

This talk is a platform-engineering case study in absorbing that complexity so our users don't have to. The best migration is the one you shrink, automate, and make users barely notice. We'll cover how we're approaching the Flink 2 upgrade, the tooling we've been building to make it a paved path, and how AI makes it possible to migrate jobs for our users at fleet scale.

  • Daniel Trager has been part of the Data Movement Engines team at Netflix for almost 2 years, working on improving the stream processing platform. His recent work has involved automating the deprecation of the legacy Kafka Source across 50+ Flink repositories using AI and automated state bootstrapping for Flink jobs.
  • Mark Cho is an engineer on the Data Movement Engines team, where he has spent the past 8 years building and leading Netflix's Flink platform that powers every Flink job across the company. His contributions span the entire stack: Netflix's Apache Flink fork, the nfflink SDK, build tooling, and the Flink control plane. Mark has led multiple Flink upgrades at Netflix since the Apache Flink 1.4 era through to today's Apache Flink 2.2 initiative.

7:15 - 7:50: Flink Issues Classification Engine
Manan Chandra, Stuart Tsao, Ankitha Gavinolla, Michael Barskii, LinkedIn
At LinkedIn, we operate a stream processing platform that powers hundreds of business-critical pipelines across the company. An enterprise Flink pipeline sits on top of a deep stack of dependencies. This includes core dependencies such as compute(K8s), IO(source and sink Kafka topics), state store(Ambry), as well as other external services and data stores. Thus debugging a pipeline failure requires looking at multiple logs and metrics, even for an experienced Flink platform engineer. Pipeline owners are product experts, with minimal knowledge of Flink and other infra components. Therefore, any failure ends up escalating to a platform on-call, which is not scalable when there is just one platform team supporting hundreds of product teams.

In this session, we'll share our journey of how we simplified debugging, first by building a LLM powered agent, but later doubling down on a deterministic(non-LLM) Issue Classification engine that encodes our on-call expertise into a decision tree.. The classification tree produces error codes that allow us to quickly categorize a failure into a platform or user level issue, thus avoiding duplicated on-call effort. The error code identifies the failure reason, saving hours in debugging time.

Key takeaways include:

  • Modeling on-call expertise as a decision tree that eliminates failure causes stage by stage, the same way an experienced engineer reasons through an incident, instead of leaving that knowledge locked in a few people's heads
  • Separating symptom from cause so that a checkpoint failure is traced to what is actually driving it, not just reported as a surface error
  • Surfacing the classification results inside the pipeline failure alerts sent to pipeline owners.
  • Producing clear ownership upfront (platform, user code, or external dependency) and remove the guesswork of who is responsible
  • Mapping failure codes to user executable mitigation actions.
  • Aggregating failures across all pipelines, for identifying weak spots in the platform.
  • Augmenting, not replacing, AI by letting the engine do the hard structured reasoning and using LLMs to refine results and generate human-readable summaries of the classification path
  • Building for extensibility with a tree that grows as new failure patterns emerge.

We'll close with a live demo of real classification results, the lessons learned, and where we're taking the engine next.

  • Manan Chandra is an engineer on the Stream Processing team at LinkedIn where he has contributed to various layers of the platform for running Flink jobs - k8s integration, config management, pre-prod testing, monitoring, observability, agentic debugging and error classification. Prior to joining LinkedIn, he has written software for diverse domains such as video streaming and healthcare.
  • Stuart Tsao is a software engineer on the Stream Processing team at LinkedIn, where he focuses on error classification and platform health at scale. His work centers on making large-scale streaming infrastructure more reliable and observable. Before LinkedIn, he worked on machine learning infrastructure.
  • Ankitha Gavinolla is a Senior Software Engineer on the Stream Processing team at LinkedIn where she has contributed to observability, autoremediation, config management, error classification and mitigation, and checkpoint management for resource efficiency. Prior to joining LinkedIn, she wrote software responsible for robotic movements in fulfillment centers at Amazon Robotics.
  • Michael Barskii: As a Senior Software Engineer on LinkedIn’s Stream Processing team, Michael specializes in enhancing the efficiency and reliability of Stream Processing jobs. His technical contributions span autosizing mechanisms, quota management, agentic debugging, and advanced error classification. Before his time at LinkedIn, he built core software systems for startups in the logistics and financial sectors.

Related topics

Apache Kafka
Stream Processing
PostgreSQL
Open Source
Apache Flink

You may also like