Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
arXiv:2603.07084v2 Announce Type: replace Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible…
