Archives AI News

What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

arXiv:2509.19590v1 Announce Type: new Abstract: Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI’s capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported…

September 25, 2025

Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation

arXiv:2509.19524v1 Announce Type: new Abstract: Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory,…

September 25, 2025

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

arXiv:2509.19517v1 Announce Type: new Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that…

September 25, 2025

Estimating the Self-Consistency of LLMs

arXiv:2509.19489v1 Announce Type: new Abstract: Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a…

September 25, 2025

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

arXiv:2506.03135v2 Announce Type: replace-cross Abstract: Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs’ understanding of basic spatial relations, such as…

September 25, 2025

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

arXiv:2509.19736v1 Announce Type: new Abstract: Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users,…

September 25, 2025

White-Basilisk: A Hybrid Model for Code Vulnerability Detection

arXiv:2507.08540v3 Announce Type: replace-cross Abstract: The proliferation of software vulnerabilities presents a significant challenge to cybersecurity, necessitating more effective detection methodologies. We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance while challenging prevailing assumptions in AI…

September 25, 2025

The Conductor and the Engine: A Path Towards Co-Designed Reasoning

arXiv:2509.19762v1 Announce Type: new Abstract: Modern LLM reasoning relies on extensive test-time computation, driven by internal model training and external agentic orchestration. However, this synergy is often inefficient, as model verbosity and poor instruction following lead to wasted compute. We…

September 25, 2025

Evaluation-Aware Reinforcement Learning

arXiv:2509.19464v1 Announce Type: new Abstract: Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems. Existing evaluation approaches frequently suffer from high variance due to limited data and long-horizon tasks, or high bias due to unequal support or…

September 25, 2025

Agentic Metacognition: Designing a “Self-Aware” Low-Code Agent for Failure Prediction and Human Handoff

arXiv:2509.19783v1 Announce Type: new Abstract: The inherent non-deterministic nature of autonomous agents, particularly within low-code/no-code (LCNC) environments, presents significant reliability challenges. Agents can become trapped in unforeseen loops, generate inaccurate outputs, or encounter unrecoverable failures, leading to user frustration and…

September 25, 2025