The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

arXiv:2309.02944v3 Announce Type: replace-cross Abstract: Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $alpha>0$, defined by the recursion, $$theta_{n+1} = theta_{n}+ alpha f(theta_n,Phi_{n+1})$$ in which $theta_ninmathbb{R}^d$ and ${Phi_{n}}$ is a Markov chain. The goal is to approximately solve root finding problem $bar{f}(theta^*)=0$, where $bar{f}(theta)=mathbb{E}[f(theta,Phi)]$ and $Phi$ has the steady-state distribution of ${Phi_{n}}$. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$, and for $alpha>0$ sufficiently small: $textbf{1.}$ The pair process ${(theta_n,Phi_n)}$ is geometrically ergodic in a topological sense. $textbf{2.}$ For every $1le ple 4$, there is a constant $b_p$ such that $limsup_{ntoinfty}mathbb{E}[|theta_n-theta^*|^p]le b_p alpha^{p/2}$ for each initial condition. $textbf{3.}$ The Polyak-Ruppert-style averaged estimates $theta^{text{PR}}_n=n^{-1}sum_{k=1}^{n}theta_k$ converge to a limit $theta^{text{PR}}_infty$ almost surely and in mean square, which satisfies $theta^{text{PR}}_infty=theta^*+alpha bar{Upsilon}^*+O(alpha^2)$ for an identified non-random $bar{Upsilon}^*inmathbb{R}^d$. Moreover, the covariance is approximately optimal: The limiting covariance matrix of $theta{text PR}_n$ is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

2025-09-04 04:00 GMT · 2 months ago arxiv.org

arXiv:2309.02944v3 Announce Type: replace-cross Abstract: Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $alpha>0$, defined by the recursion, $$theta_{n+1} = theta_{n}+ alpha f(theta_n,Phi_{n+1})$$ in which $theta_ninmathbb{R}^d$ and ${Phi_{n}}$ is a Markov chain. The goal is to approximately solve root finding problem $bar{f}(theta^*)=0$, where $bar{f}(theta)=mathbb{E}[f(theta,Phi)]$ and $Phi$ has the steady-state distribution of ${Phi_{n}}$. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$, and for $alpha>0$ sufficiently small: $textbf{1.}$ The pair process ${(theta_n,Phi_n)}$ is geometrically ergodic in a topological sense. $textbf{2.}$ For every $1le ple 4$, there is a constant $b_p$ such that $limsup_{ntoinfty}mathbb{E}[|theta_n-theta^*|^p]le b_p alpha^{p/2}$ for each initial condition. $textbf{3.}$ The Polyak-Ruppert-style averaged estimates $theta^{text{PR}}_n=n^{-1}sum_{k=1}^{n}theta_k$ converge to a limit $theta^{text{PR}}_infty$ almost surely and in mean square, which satisfies $theta^{text{PR}}_infty=theta^*+alpha bar{Upsilon}^*+O(alpha^2)$ for an identified non-random $bar{Upsilon}^*inmathbb{R}^d$. Moreover, the covariance is approximately optimal: The limiting covariance matrix of $theta{text PR}_n$ is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

Original: https://arxiv.org/abs/2309.02944