Webinar: Distributed Stream Processing in Practice [Scalable, Real-time Data Pip


Details
About the Event
This technical session examines real-world challenges and patterns in building distributed stream processing systems. We focus on scalability, fault tolerance, and latency trade-offs through a concrete case study, using specific frameworks like Apache Storm as supporting tools to illustrate production concepts.
Why Should You Attend
Learn practical patterns for distributed stream processing at scale:
- Master real-world challenges - Understand scalability, fault tolerance, and latency trade-offs in production
- See architectural patterns - Stateless vs. stateful processing, event time vs. processing time decisions
- Handle scale bottlenecks - Partitioning strategies, backpressure handling, and scheduling challenges
- Learn from concrete examples - Real ML feature generation pipeline using Storm and Kafka
Perfect for: Data engineers building distributed streaming systems who need production-proven patterns.
------------------------------------------------------------
Agenda (30 minutes)
1. Stream Processing: Past and Now (4 minutes)
- Rise of real-time data needs in ML, analytics, and user-facing apps
- Shift from batch-first to event-first architectures
2. Distributed Stream Processing Fundamentals (5 minutes)
- Definition and fundamentals
- Processing types: at-most-once, at-least-once, exactly-once
- Batch vs. micro-batch vs. true streaming
3. Architectural Patterns (6 minutes)
- Stateless vs. stateful processing
- Event time vs. processing time
- Schedulers
Common architecture: Kafka → Stream Processor → Sink (DB, Lake, Dashboard)
4. Designing for Scale (6 minutes)
- Partitioning strategies and operator parallelism
- Handling backpressure and traffic spikes
- Scheduling challenges and system bottlenecks
- Fault tolerance and availability
5. Case Study: Real-Time ML Feature Generation (10 minutes)
- Event Source (Kafka): Collects user events
- Stream Engine (Apache Storm): Processes and transforms streams
- Storage (S3): Stores aggregated feature datasets
- Setup: 1 Nimbus + 3 Workers distributed topology
- Model Training: Python jobs consume features

Webinar: Distributed Stream Processing in Practice [Scalable, Real-time Data Pip