Archives AI News

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

arXiv:2602.11786v2 Announce Type: replace Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational…

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

arXiv:2604.24954v1 Announce Type: new Abstract: We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements…

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

arXiv:2603.26554v2 Announce Type: replace Abstract: Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative…

Compute Aligned Training: Optimizing for Test Time Inference

arXiv:2604.24957v1 Announce Type: new Abstract: Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a…

CoreFlow: Low-Rank Matrix Generative Models

arXiv:2604.24959v1 Announce Type: new Abstract: Learning matrix-valued distributions from high-dimensional and possibly incomplete training data is challenging: ambient-space generative modeling is computationally expensive and statistically fragile when the matrix dimension is large but the sample size is limited. We propose…

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

arXiv:2604.24964v1 Announce Type: new Abstract: Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing…