Archives AI News

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

arXiv:2511.09904v1 Announce Type: new Abstract: AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to…

November 15, 2025

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

arXiv:2511.07250v2 Announce Type: replace-cross Abstract: The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g.,…

November 15, 2025

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models

arXiv:2511.09907v1 Announce Type: new Abstract: Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver’s ability…

November 15, 2025

RoboBenchMart: Benchmarking Robots in Retail Environment

arXiv:2511.10276v1 Announce Type: cross Abstract: Most existing robotic manipulation benchmarks focus on simplified tabletop scenarios, typically involving a stationary robotic arm interacting with various objects on a flat surface. To address this limitation, we introduce RoboBenchMart, a more challenging and…

November 15, 2025

OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive

arXiv:2511.09914v1 Announce Type: new Abstract: The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health…

November 15, 2025

Simulating Misinformation Propagation in Social Networks using Large Language Models

arXiv:2511.10384v1 Announce Type: cross Abstract: Misinformation on social media thrives on surprise, emotion, and identity-driven reasoning, often amplified through human cognitive biases. To investigate these mechanisms, we model large language model (LLM) personas as synthetic agents that mimic user-level biases,…

November 15, 2025

Adaptive Hyperbolic Kernels: Modulated Embedding in de Branges-Rovnyak Spaces

arXiv:2511.09921v1 Announce Type: new Abstract: Hierarchical data pervades diverse machine learning applications, including natural language processing, computer vision, and social network analysis. Hyperbolic space, characterized by its negative curvature, has demonstrated strong potential in such tasks due to its capacity…

November 15, 2025

Reasoning About Intent for Ambiguous Requests

arXiv:2511.10453v1 Announce Type: cross Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single…

November 15, 2025

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

arXiv:2511.09993v1 Announce Type: new Abstract: We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across…

November 15, 2025

Preview, Accept or Discard? A Predictive Low-Motion Interaction Paradigm

arXiv:2511.10532v1 Announce Type: cross Abstract: Repetitive strain injury (RSI) affects roughly one in five computer users and remains largely unresolved despite decades of ergonomic mouse redesign. All such devices share a fundamental limitation: they still require fine-motor motion to operate.…

November 15, 2025