Big Data on Kubernetes


Updated Location! Please note that the map provided by is incorrect. The proper location of the Ericsson house shown here:

Kubernetes slowly became the stand enterprise runtime fabric. It has changed the IT landscape dramatically and unified/simplified the deployment model for applications in the cloud, on-prem or federated clusters. There is now a growing interest in running Big Data workloads on Kubernetes - the “cloud native” way. Big data relies on several projects/services like YARN for scheduling and Zookeeper for consistency. While these were great tools for on-prem environments they have failed to progress with the pace of the requirements and technologies used in dynamic cloud environments..

This meetup is jointly organized by the Kubernetes and Cloud Native Computing meetup ( and the Budapest Big Data meetup (

Talks and speakers:

1. Benefits of big data workloads on Kubernetes
Deep dive into changes in the stack, benefits and future plans and comparing them with the latest changes introduced in Hadoop 3.0 - for all to conclude that all these are to little, to late and technically inferior. The session will approach the problem and the solution the community choose and introduce attendees to architectural details, sample uses cases of big data workloads on k8s and will showcase the many benefits of running it on Kubernetes. We believe that at the end of the session attendees will have a better understanding of the integration and the out of the box advantages brought by k8s - where cloud agnosticism, monitoring, services, discovery and tracing are first class citizens of the runtime fabric.

Speaker: Janos Matyas

2. Spark and Zeppelin on Kubernetes

Spark is becoming the standard for data processing but originally designed mostly for static environments. At the same time applications deployed to Kubernetes are environment agnostic and run on-premise or in the cloud the same way.

This session is deep diving into the internals of the Spark on k8s integration - starting from scheduling (using the native k8s scheduler), monitoring (Prometheus) to autoscaling (alerts) and data locality. We demonstrate the technical benefits of running Spark on Kubernetes and the operational benefits of avoiding workload "islands". We also walk through the anatomy of a Spark application on k8s and explaining the flow/role of each component: Driver, Executor, External Shuffle Service and Resource Staging Server. We also showcase running Zeppelin notebooks natively on Kubernetes. At the end of the session, we demo a canary Spark app release using Istio.

Speakers: Sebastian Toader, Sandor Magyari

3. Kafka on Kubernetes
In this talk we discuss deploying, running and operating Apache Kafka on top of Kubernetes.
tl;dr: we have removed Zookeeper and use etcd instead.

Kafka is a popular streaming system - and it's frequently used and deployed to k8s. It uses Zookeeper internally and prior being able to use a Kafka cluster we need to setup, deploy and operate a ZK cluster. This implies an unnecessary operational overhead and introduces complexity and ads an external point of failure. Features provided by ZK are already available in Kubernetes, provided by etcd. We have removed all the ZK dependencies from the Kafka codebase and now all quotas, controller election, cluster membership and configuration operations are dispatched to etcd. A better and quicker reaction to broker failures, workload reschedules/rebalances and monitoring are inherited from Kubernetes and makes Kafka easier to scale and operate. Speakers: Sebastian Toader, Balint Molnar

This is an English speaking event. Venue and catering provided by Ericsson R&D.