Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
arXiv:2512.17131v3 Announce Type: replace Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov’s method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure…
