Learning Dynamics of VLM Finetuning
arXiv:2510.11978v1 Announce Type: new Abstract: Preference-based finetuning of vision–language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as textbf{learning-dynamics–aware optimization} and introduce textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and…
