Robust AI Evaluation through Maximal Lotteries
arXiv:2602.21297v1 Announce Type: new Abstract: The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the “better” of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking,…
