Archives AI News

BuilderBench — A benchmark for generalist agents

arXiv:2510.06288v1 Announce Type: new Abstract: Today’s AI models learn primarily through mimicry and sharpening, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills…

Auto-Prompt Ensemble for LLM Judge

arXiv:2510.06538v1 Announce Type: new Abstract: We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit…

WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

arXiv:2510.06587v1 Announce Type: new Abstract: Large language model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long horizon navigation, large scale information…

Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

arXiv:2510.07191v1 Announce Type: cross Abstract: Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta’s DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design…

Fine-Grained Emotion Recognition via In-Context Learning

arXiv:2510.06600v1 Announce Type: new Abstract: Fine-grained emotion recognition aims to identify the emotional type in queries through reasoning and decision-making processes, playing a crucial role in various systems. Recent methods use In-Context Learning (ICL), enhancing the representation of queries in…

Vibe Checker: Aligning Code Evaluation with Human Preference

arXiv:2510.07315v1 Announce Type: cross Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human…