Foundations of Distributed Training: How Modern AI Systems Are Built
Details
We are excited to welcome Suman Debnath, Technical Lead in Machine Learning at Anyscale, for a practical and intuitive introduction to distributed training.
Talk Description:
As modern AI models continue to grow, single-GPU training is no longer enough. Distributed training has become essential, but scaling models introduces challenges that require understanding communication patterns, system bottlenecks, and key trade-offs.
In this session, we will break down distributed training from first principles. We will explore why single-GPU training hits limits, how transformer models manage memory, and what techniques like gradient accumulation, checkpointing, and data parallelism actually do.
We will also demystify communication primitives, walk through ZeRO-1, ZeRO-2, ZeRO-3 and FSDP, and show how compute and communication can be overlapped for better efficiency. Finally, we will connect these concepts to real-world tooling used in frameworks like Ray and PyTorch. Attendees will gain a clear, grounded understanding of how distributed training works and when to apply different strategies.
Bio:
Suman Debnath is a Technical Lead in Machine Learning at Anyscale, where he works on large-scale distributed training, fine-tuning, and inference optimization in the cloud. His expertise spans Natural Language Processing, Large Language Models, and Retrieval-Augmented Generation.
He has also spoken at more than one hundred global conferences and events, including PyCon, PyData, and ODSC, and has previously built performance benchmarking tools for distributed storage systems.
We look forward to seeing you!
#DataScience #MachineLearning #DistributedTraining #Ray #PyTorch #LLM #RAG #DeepLearning #USFCA #USFMSDSAI #DataInstitute #AIEngineering #TechTalk
