Optimizing distributed training for LLMs


Details
Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops of computation.
This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer dedicated to open science.
Sajal Dash, a Research Scientist at Oak Ridge National Lab, and his team explored various parallel training techniques, such as tensor, pipeline, and sharded data parallelism, to train a trillion-parameter model. Through empirical analysis and hyperparameter tuning, they identified strategies for training large LLMs of varying sizes.
For 22 Billion, 175 Billion, and 1 Trillion parameters, they achieved GPU throughputs of 38.38%, 36.14%, and 31.96%, respectively. They also achieved 100% weak scaling efficiency on 1024 and 3072 MI250X GPUs for the 175 Billion and 1 Trillion parameter models, with strong scaling efficiencies of 89% and 87%.

Optimizing distributed training for LLMs