ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training
arXiv:2511.20718v1 Announce Type: new Abstract: PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we…
