arXiv:2502.13822v2 Announce Type: replace Abstract: We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains. We apply these results to analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations, a widely used method for policy evaluation in Reinforcement Learning (RL), obtaining a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish an $O(T^{-frac{1}{4}}log T)$ distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. Our martingale bounds are of broad applicability, and our analysis of TD learning provides new insights into statistical inference for RL algorithms, bridging gaps between classical stochastic approximation theory and modern RL applications.
Original: https://arxiv.org/abs/2502.13822
