Archives AI News

Hierarchical Reasoning Model: A Critical Supplementary Material

arXiv:2510.00355v1 Announce Type: new Abstract: Transformers have demonstrated remarkable performance in natural language processing and related domains, as they largely focus on sequential, autoregressive next-token prediction tasks. Yet, they struggle in logical reasoning, not necessarily because of a fundamental limitation…

October 2, 2025

Rethinking Thinking Tokens: LLMs as Improvement Operators

arXiv:2510.01123v1 Announce Type: cross Abstract: Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost,…

October 2, 2025

Semantic-Driven AI Agent Communications: Challenges and Solutions

arXiv:2510.00381v1 Announce Type: new Abstract: With the rapid growth of intelligent services, communication targets are shifting from humans to artificial intelligent (AI) agents, which require new paradigms to enable real-time perception, decision-making, and collaboration. Semantic communication, which conveys task-relevant meaning…

October 2, 2025

Whose Journey Matters? Investigating Identity Biases in Large Language Models (LLMs) for Travel Planning Assistance

arXiv:2410.17333v2 Announce Type: replace Abstract: As large language models (LLMs) become increasingly integral to the hospitality and tourism industry, concerns about their fairness in serving diverse identity groups persist. Grounded in social identity theory and sociotechnical systems theory, this study…

October 2, 2025

Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

arXiv:2510.00415v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making…

October 2, 2025

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

arXiv:2508.07642v2 Announce Type: replace Abstract: Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to…

October 2, 2025

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

arXiv:2510.00436v1 Announce Type: new Abstract: Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses–human expert review–is labor-intensive and slow, limiting scalability.…

October 2, 2025

Expandable Decision-Making States for Multi-Agent Deep Reinforcement Learning in Soccer Tactical Analysis

arXiv:2510.00480v1 Announce Type: new Abstract: Invasion team sports such as soccer produce a high-dimensional, strongly coupled state space as many players continuously interact on a shared field, challenging quantitative tactical analysis. Traditional rule-based analyses are intuitive, while modern predictive machine…

October 2, 2025

Balancing Multimodal Training Through Game-Theoretic Regularization

arXiv:2411.07335v3 Announce Type: replace-cross Abstract: Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources. Yet, current training methods often underperform due to modality competition, a phenomenon where modalities contend for training resources leaving some underoptimized.…

October 2, 2025

Rethinking Reward Models for Multi-Domain Test-Time Scaling

arXiv:2510.00492v1 Announce Type: new Abstract: The reliability of large language models (LLMs) during test-time scaling is often assessed with emph{external verifiers} or emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs),…

October 2, 2025