arXiv:2506.00286v2 Announce Type: replace-cross
Abstract: In this paper, we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $pi^*$ in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $betaneq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(varepsilon,delta)$-PAC-bounds on $|Q^*-Q^k|$, and $|V^*-V^{pi_k}|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $pi_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $frac{1}{1-gamma}$ and the strength of this dependence grows with the learners risk-sensitivity $|beta|$. We also provide two lower bounds which shows that exponential dependence on $|beta|frac{1}{1-gamma}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters $S,A,delta,varepsilon$ and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.
