On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
arXiv:2505.17508v4 Announce Type: replace Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs.…
