Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
arXiv:2606.02684v1 Announce Type: new Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative,…
