Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models
arXiv:2511.00066v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet…
