arXiv:2506.07040v2 Announce Type: replace-cross Abstract: We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $tilde{mathcal{O}}(epsilon^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $epsilon$-optimal robust policy within $tilde{mathcal{O}}(epsilon^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.
Original: https://arxiv.org/abs/2506.07040
