Apache Pinot Contributor Call #4
Details
Join us for our monthly Contributor Office Hours - a casual, open session to:
✅ Present your pull requests, proposals, or ideas
✅ Ask questions and get live feedback
✅ Learn how to get started contributing to Apache Pinot
Whether you’re a seasoned committer or brand new to the project, you’re welcome to join!
Hosts: Xiang Fu & Robert Zych (Apache Pinot Committers)
Agenda:
8:30AM - 9:00 AM
Title: Eliminating Real-Time Analytics Latency Through Pauseless Consumption
Speaker: Aman Khanchandani
Summary: Streaming platforms like Kafka and Kinesis provide events with sub-second SLAs, but events alone are not enough. Apache Pinot transforms these events into actionable insights with sub-second query capabilities, enabling real-time analytics at scale.
Pinot achieves sub-second query performance by continuously ingesting and indexing data in memory. However, Pinot requires periodic persistence for durability. This creates a fundamental challenge: Pinot must pause ingestion to convert in-memory rows and indexes into persistent segments, flush them to disk, and upload them to deep store before resuming consumption. Depending on data volume and indexing complexity, these pauses can range from seconds to minutes. During these intervals, users lose access to the most recent data, creating a critical gap in real-time analytics that impacts time-sensitive decisions.
In this talk, we introduce "Pauseless Consumption" for Apache Pinot, a novel approach that eliminates data pauses and improves data freshness. We'll demonstrate how this feature allows Pinot to continue ingesting data during the build and upload phase- processing new data in a fresh segment while simultaneously completing the build and upload of the older segment. Letting go of the sequential pause-commit-proceed strategy introduced significant challenges: preventing data loss during server failures when segments are in transition states, managing complex segment states during parallel processing, implementing robust failure handling during commit protocol failures, and developing timeout mechanisms for lagging servers to maintain system consistency.
Performance tests show remarkable improvements: in scale tests ingesting 300K events per second, with Pauseless Consumption enabled, data freshness delay is reduced from approximately 300 seconds to 5 seconds. This 60x reduction ensures that even during peak ingestion periods, your real-time data SLAs can be maintained, making true real-time analytics possible.
9:00AM - 9:30 AM
Title: Ingesting Semi-Structured JSON Data in Apache Pinot
Speaker: Xin Gao
Summary: Apache Pinot is a high-performance OLAP datastore built around static schemas for fast, predictable queries. But in the real world, data often arrives in semi-structured JSON form — dynamic, nested, and inconsistent.
This talk explores how Pinot bridges that gap: how to define flexible schema mappings, transform JSON efficiently during ingestion, and store data in a performant and efficient layout. We’ll walk through key design considerations, the core SchemaConformingTransformer and the best practices to make semi-structured data ingestion seamless in Pinot.
Want to be involved? Join the Contributors channel on Slack >
*The call will be recorded and shared on the channel afterwards^
Join the Community Monthly Newsletter : Everything's Apache Pinot! (and get a chance to win a T shirt!)
