arXiv:2601.07326v1 Announce Type: cross
Abstract: This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $frac{1}{K}sum_{k=1}^K Eleft[|nabla f(X_k)|_*right]leq O(frac{sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $|nabla f(X)|_Fleq |nabla f(X)|_*leq sqrt{m+n}|nabla f(X)|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $frac{1}{K}sum_{k=1}^KEleft[|nabla f(X_k)|_Fright]leq O(frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $|nabla f(X)|_*= Theta(sqrt{m+n})|nabla f(X)|_F$.
