Training Deep Neural Networks on Distributed GPUs


Details
🚀 Welcome to the first PyData Cyprus meetup of 2021 🚀
In this talk, Dr. Nikolaos Bakas will talk to us about scaling deep neural networks across multiple GPUs for computer vision tasks, highlighting the data parallelization method.
We would really like this to be a physical event so that we can have a beer and pizza together, but that has to wait for a bit longer. For now, we are sticking with virtual events.
This meetup is organized jointly with the Computation-based Science and Technology Research Center (CaSToRC) of The Cyprus Institute.
Abstract
In this talk, the training Deep Neural Networks on distributed GPUs will be presented, utilizing PyTorch and Horovod. The optimization algorithms, employed during the training process of a deep network, can be parallelized by following two main routes. The first one is Data Parallelism, which regards the splitting of the “batch of samples” (utilized in each iteration) into a number of smaller mini-batches, which are processed in parallel, depending on the number of available resources (GPUs). Alternatively, we may use Model Parallelism, by partitioning the deep learning model on distributed GPUs. For multi-GPU training, we will be utilizing Horovod, a library that has been developed at Uber and used in the training events of NVIDIA. Particularly, by using Horovod, we may take a single-GPU training script and efficiently scale it to run across many GPUs in parallel. With MPI commands, we may initialize and get the MPI rank for each process, in a straightforward manner and with fewer code changes than other solutions. Ultimately, Horovod scripts can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes. We will present experiments on the Cyclone Supercomputer of the Cyprus Institute, utilizing PyTorch for computer vision tasks, highlighting the efficiency of Data Parallelism, as well as scaling up capabilities compared with standard machine learning platforms as Kaggle and Google Colab.
About the speaker
Dr. Nikolaos Bakas is an Associate Research Scientist in the Computation-based Science and Technology Research Center (CaSToRC). He holds a Ph.D. in Engineering Optimization from the National Technical University of Athens, and has been dealing with machine learning algorithms and numerical methods, in a wide range of problems, in scientific research and industrial applications.
Code of conduct

Sponsors
Training Deep Neural Networks on Distributed GPUs