Apache Spark on Kubernetes: What Works, What Breaks, and How to Fix It
Details
Apache Spark on Kubernetes: What Works, What Breaks, and How to Fix It
(Practical lessons from running and benchmarking real Spark lakehouse workloads)
-----------------------------------------------------------------------------------
Where is the event? COWRKS Ecoworld 4D, 10th Floor, Building 4D, ECOWORLD, Outer Ring Rd, Devarabisanahalli, Bellandur, Bengaluru, Karnataka 560103
Map: https://share.google/1za6kRsnu6SbStkha
How to Register? https://docs.google.com/forms/d/1tQECsMnaWWFCYsznH611SkpOezTQHdEY-4KqMBKkJDA/
-----------------------------------------------------------------------------------
Overview:
Apache Spark and Kubernetes are increasingly becoming the foundation of modern cloud-native data platforms. While Kubernetes makes it easier to deploy and scale Spark clusters, running Spark workloads efficiently in this environment still requires careful tuning, observability, and architectural decisions.
In this open learning session, engineers from Onehouse will share practical lessons from working with some of the largest data lake deployments built on Apache Spark and open table formats over the past several years. Through this experience, the team has worked closely with large-scale lakehouse workloads and Spark pipelines across a variety of production environments.
We’ll explore topics such as:
- How Kubernetes changes the way Spark clusters are deployed, scaled, and managed
- Techniques to improve Spark SQL performance and query execution
- Optimizing Spark reads and writes across open table formats like Apache Hudi, Apache Iceberg, and Delta
- Identifying compute waste and storage bottlenecks in Spark workloads
- Lessons learned from benchmarking and analyzing large-scale Spark workloads
We’ll also walk through examples of how Spark job analysis using tools like the Spark History Server can help surface performance issues and generate actionable optimization insights.
Expect architecture discussions, real-world performance benchmarks, and practical demos, along with an open discussion on how to run Spark workloads more effectively in modern cloud environments.
