How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization
arXiv:2602.19208v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance…
