AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
arXiv:2508.06944v3 Announce Type: replace Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and…
