On the $O(frac{sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $ell_1$ Norm

2025-10-05 19:00 GMT · 6 months ago aimagpro.com

arXiv:2505.11840v3 Announce Type: replace
Abstract: As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $frac{1}{K}sum_{k=1}^KEleft[||nabla f(x^k)||_1right]leq O(frac{sqrt{d}C}{K^{1/4}})$ for AdamW measured by $ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $||nabla f(x)||_2ll ||nabla f(x)||_1leq sqrt{d}||nabla f(x)||_2$ for any high-dimensional vector $x$ and $Eleft[||nabla f(x)||_1right]geqsqrt{frac{2d}{pi}}Eleft[||nabla f(x)||_2right]$ when each element of $nabla f(x)$ is generated from Gaussian distribution $mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||nabla f(x)||_1=varTheta(sqrt{d})||nabla f(x)||_2$. Both support that our convergence rate can be considered to be analogous to the optimal $frac{1}{K}sum_{k=1}^KEleft[||nabla f(x^k)||_2right]leq O(frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case. We also extend our result to NAdamW, an AdamW variant that employs a double-momentum mechanism, and demonstrate that it maintains the same convergence rate.