Spark on Kubernetes, and State Management in Structured Streaming


Details
In this meetup, we will cover:
a) Kubernetes Native Integration with Spark introduced in Spark 2.3.0
b) Deep Dive into State Management in Structured Streaming
Agenda:
09:00 AM - 10:00 AM - Registration, Welcome Note
10:00 AM - 12:15 PM - Spark on Kubernetes, by Madhukara Phatak, Director of Engineering at Tellius (includes 15-mins break from 11:00 AM - 11:15 AM)
12:15 PM - 12:30 PM - Short Break
12:30 PM - 01:30 PM - Understanding State Management in Structured Streaming, by Chandan Prakash, Data Engineer at Qubole
01:30 PM - 02:30 PM - Lunch and Networking
Abstracts:
a) Kubernetes Native Integration with Spark introduced in Spark 2.3.0
If you are new to Kubernetes, we recommend that you watch the following video from the previous meetup - https://www.youtube.com/watch?v=Q0miRvKA4yk - the same will serve as a pre-read to the upcoming session in which we will cover the case of native integration of Kubernetes with Spark.
b) Deep Dive into State Management in Structured Streaming
Stateful processing in stream processing needs to manage the state of intermediate data for operations like aggregation, groupby, de-duplication. Structured streaming, which is a new SQL based stream processing in Spark, has taken a different and more efficient approach to manage state compared to older DStream based Spark streaming. In this talk we will discuss in detail about:
Architecture of the new state management in structured streaming
Comparison with older stream based Spark streaming in managing state
Deep dive into streaming code to understand how state management works in structured streaming with demo example

Spark on Kubernetes, and State Management in Structured Streaming