What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities
arXiv:2509.19590v1 Announce Type: new Abstract: Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI’s capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported…
