Apache Druid and YuniKorn: Universal Resource scheduler for both K8s and Yarn

Are you going?

132 people going

Share:

Details

Agenda:
6 pm -- 6:30 pm Check-in + Networking
6:30 pm -- 7:20 pm Talk 1 (Cloudera)
7:20 pm -- 8:10 pm Talk 2 (Imply)
8:30 pm -- 9 pm Networking
9 pm -- closing

Talk 1 :
YuniKorn: A Universal Resource Scheduler for both Kubernetes and YARN.

We will talk about our open source work - YuniKorn scheduler project (Y for YARN, K for K8s, uni- for Unified) brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Any improvements to this scheduler can benefit both Kubernetes and YARN community.

YARN schedulers are optimized for high-throughput, multi-tenant batch workloads. It can scale up to 50k nodes per cluster, and schedule 20k containers per second; On the other side, Kubernetes schedulers are optimized for long-running services, but many features like hierarchical queues, fairness resource sharing, and preemption etc, are either missing or not mature enough at this point of time.

However, underneath they are responsible for one same job: the decision maker for resource allocations. We see the need to run services on YARN as well as run jobs on Kubernetes. This motivates us to create a universal scheduler which can work for both YARN and Kubernetes, and configure in the same way.

This YuniKorn scheduler (Y for YARN, K for K8s, uni- for Unified) brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Most importantly, it provides the opportunity to let YARN and Kubernetes share the same user experience on scheduling big data workloads. And any improvements to this scheduler can benefit both Kubernetes and YARN community.

In this talk, we’re going to talk about our efforts of design and implement the YuniKorn scheduler. We have integrated it with both YARN and Kubernetes. We will show demos and best practices.

Speaker: Wangda Tan ,Suma Shivaprasad (Cloudera)

Wangda is PMC member of Apache Hadoop and Sr. Engineering Manager of computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-prem use cases of Cloudera. His primary interesting areas are YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and Hadoop submarine project (running Deep learning workload across YARN and Kubernetes). He has also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc. efforts in the Hadoop YARN community. Previously, he worked at Pivotal working on OpenMPI/GraphLab and Alibaba on cloud computing, large scale machine learning, matrix and statistics computation platform with Map-Reduce and MPI.

Suma Shivaprasad is an Apache Hadoop Committer and member of Apache Atlas Project Management Committee.
Working in the compute platform team at Cloudera that focuses on Hadoop, YARN, Kubernetes and enabling these platforms in the Public Cloud.

Talk 2: Swimming in the Data River

The dirty secret of most “streaming analytics” technologies is that they are just stream processors: they sit on a stream and continuously compute the results of a particular query. They’re good for alerting, keeping a dashboard up-to-date in real time, and streaming ETL, but they’re not good at powering apps that give you true insight into what is happening: for this you need the ability to explore, slice/dice, drill down, and search into the data. This talk will cover the current state of the streaming analytics world and what Apache Druid, a real-time analytical database, brings to the table.

Speaker Gian (Imply)

Gian is a co-founder and CTO of Imply, a San Francisco based technology company. Gian is also one of the main committers of Druid. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech.