MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
arXiv:2601.09085v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of…
