Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
arXiv:2509.23371v2 Announce Type: replace-cross Abstract: Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy.…
