arXiv:2604.21016v1 Announce Type: new
Abstract: When training neural networks with full-batch gradient descent (GD) and step size $eta$, the largest eigenvalue of the Hessian — the sharpness $S(boldsymbol{theta})$ — rises to $2/eta$ and hovers there, a phenomenon termed the Edge of Stability (EoS). citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(boldsymbol{theta})leq 2/eta$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/eta$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression.
We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/eta$. Following the approach of citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $Delta S = eta beta sigma_{boldsymbol{u}}^{2}/(4alpha)$, where $alpha$ is the progressive sharpening rate, $beta$ is the self-stabilization strength, and $sigma_{ boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.
