ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
arXiv:2406.02613v3 Announce Type: replace Abstract: Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting…
