Skip to content

Scale-up and down:Spark on K8s, elastic distributed deep learning & AI Engine

T
Hosted By
T.J. B.
Scale-up and down:Spark on K8s, elastic distributed deep learning & AI Engine

Details

We are excited to have Google, IBM and Alibaba to discuss scale up and down Spark on K8s, Deep Learning and Analytics. The venue is sponsored by Bolt.io

Agenda:
6:00 pm -- 6:30 pm check-in
6:30 pm -- 6:35 pm introduction
6:35 pm -- 7:10 pm Talk 1 (Google) + QA
7:10 pm -- 7:45 pm Talk 2 (IBM) + QA
7:45 pm -- 8:20 pm Talk 3 (Alibaba) + QA

Talk 1: Improving Apache Spark Downscaling
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again. Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).

Speakers: Ben Sidhom, Karthik Palaniappan

Ben and Karthik are software engineers at Google. Both of them work on Cloud Dataproc, focusing on the scaling experience.

Talk 2: Elastic Distributed Deep Learning Training at large scale on-prem and cloud productions

In this talk, we like to introduce a new high performance distributed deep learning training engine. It transforms static monolithic training into a dynamic resilient process - automatically scales up and down GPU allocation while transparently training the models developed in popular frameworks such as TensorFlow, Pytorch and Caffe.

  • The architecture and design of dynamic distributed runtime management
  • Transparent dynamic model scaling (up and down). Transparency means no code change or minimum change (1 line) to models developed in Tensorflow, Pytorch.
  • Experiences with large scale on-prem and cloud deployments with different (QoS) policies
  • hyper-parameters

Speaker: Yonggang Hu

Yonggang is IBM Distinguished Engineer, Chief Architect at Spectrum Computing, IBM Systems.He has been working on distributed computing, HPC, grid, cloud and big data analytics for the past 20 years. He is currently focusing on AI runtime at IBM and responsible for roadmap and strategy of IBM Watson Machine Learning Accelerator.

Talk3 : Building an AI engine for time series data analytics

Alibaba’s TSDB is a time series database is the backbone service for hosting all this data to enable high-concurrency storage and low-latency query. TSDB’s AI engine provides intelligent advanced analysis capabilities and end-to-end business intelligence solutions and empowers companies across various industries to better understand data trends, discover anomalies, manage risks, and boost efficiency. So far, the company has scaled the service to thousands of physical nodes and delivered peak performance at 80 million operations per second.

Jian Chang will outline the design of the AI engine built on Alibaba’s TSDB service, which enables fast and complex analytics of large-scale time series data in many business domains. Along the way, they highlight solutions to the major technical challenges in data storage, processing, feature engineering, and machine learning algorithm design.

speaker : Jian Chang
Jian Chang is a senior algorithm expert at the Alibaba Group, A data science expert and software system architect with expertise in machine learning and big data systems and deep domain knowledge.

Photo of SF Big Analytics group
SF Big Analytics
See more events
724 Brannan St
724 Brannan St · San Francisco, CA