Foundations of Distributed Training: How Modern AI Systems Are Built

Name: Foundations of Distributed Training: How Modern AI Systems Are Built
Start: 2026-02-20T12:30:00-08:00
End: 2026-02-20T14:00:00-08:00
Location: 101 Howard St, University of San Francisco - Downtown Campus, San Francisco, CA 94105

Hosted by Aline A.

USF Data Science & Artificial Intelligence Speaker Series

Details

We are excited to welcome Suman Debnath, Technical Lead in Machine Learning at Anyscale, for a practical and intuitive introduction to distributed training.

Talk Description:
As modern AI models continue to grow, single-GPU training is no longer enough. Distributed training has become essential, but scaling models introduces challenges that require understanding communication patterns, system bottlenecks, and key trade-offs.
In this session, we will break down distributed training from first principles. We will explore why single-GPU training hits limits, how transformer models manage memory, and what techniques like gradient accumulation, checkpointing, and data parallelism actually do.
We will also demystify communication primitives, walk through ZeRO-1, ZeRO-2, ZeRO-3 and FSDP, and show how compute and communication can be overlapped for better efficiency. Finally, we will connect these concepts to real-world tooling used in frameworks like Ray and PyTorch. Attendees will gain a clear, grounded understanding of how distributed training works and when to apply different strategies.

Bio:
Suman Debnath is a Technical Lead in Machine Learning at Anyscale, where he works on large-scale distributed training, fine-tuning, and inference optimization in the cloud. His expertise spans Natural Language Processing, Large Language Models, and Retrieval-Augmented Generation.
He has also spoken at more than one hundred global conferences and events, including PyCon, PyData, and ODSC, and has previously built performance benchmarking tools for distributed storage systems.

We look forward to seeing you!

#DataScience #MachineLearning #DistributedTraining #Ray #PyTorch #LLM #RAG #DeepLearning #USFCA #USFMSDSAI #DataInstitute #AIEngineering #TechTalk

AI summary

By Meetup

Intro to distributed training for ML engineers; grasp scaling limits, memory management, and ZeRO/FSDP; outcome: choose and apply a distribution approach.

AI summary

By Meetup

Intro to distributed training for ML engineers; grasp scaling limits, memory management, and ZeRO/FSDP; outcome: choose and apply a distribution approach.

USF Data Science & Artificial Intelligence Speaker Series

Foundations of Distributed Training: How Modern AI Systems Are Built

USF Data Science & Artificial Intelligence Speaker Series

Details

AI summary

AI summary

Related topics

You may also like