Weight Decay may matter more than muP for Learning Rate Transfer in Practice
arXiv:2510.19093v1 Announce Type: new Abstract: Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning…
