Archives AI News

Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

arXiv:2603.27343v1 Announce Type: new Abstract: Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated…

March 31, 2026

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

arXiv:2603.28522v1 Announce Type: cross Abstract: We present LAD, a real-time language–action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is…

March 31, 2026

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

arXiv:2603.27355v1 Announce Type: new Abstract: We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then…

March 31, 2026

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

arXiv:2603.28662v1 Announce Type: cross Abstract: Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually…

March 31, 2026

Defend: Automated Rebuttals for Peer Review with Minimal Author Guidance

arXiv:2603.27360v1 Announce Type: new Abstract: Rebuttal generation is a critical component of the peer review process for scientific papers, enabling authors to clarify misunderstandings, correct factual inaccuracies, and guide reviewers toward a more accurate evaluation. We observe that Large Language…

March 31, 2026

Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking

arXiv:2509.23392v3 Announce Type: replace Abstract: Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path…

March 31, 2026

Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring

arXiv:2603.27404v1 Announce Type: new Abstract: Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical…

March 31, 2026

Offline Materials Optimization with CliqueFlowmer

arXiv:2603.06082v3 Announce Type: replace Abstract: Recent advances in deep learning inspired neural network-based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling…

March 31, 2026

On the Relationship between Bayesian Networks and Probabilistic Structural Causal Models

arXiv:2603.27406v1 Announce Type: new Abstract: In this paper, the relationship between probabilistic graphical models, in particular Bayesian networks, and causal diagrams, also called structural causal models, is studied. Structural causal models are deterministic models, based on structural equations or functions,…

March 31, 2026

Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection

arXiv:2408.13516v2 Announce Type: replace-cross Abstract: Few-shot multi-class anomaly detection is crucial in real industrial settings, where only a few normal samples are available while numerous object types must be inspected. This setting is challenging as defect patterns vary widely across…

March 31, 2026