Archives AI News

Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

arXiv:2508.19831v2 Announce Type: replace-cross Abstract: Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we…

Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

arXiv:2510.12981v1 Announce Type: new Abstract: Current unlearning metrics for generative models evaluate success based on reference responses or classifier outputs rather than assessing the core objective: whether the unlearned model behaves indistinguishably from a model that never saw the unwanted…

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

arXiv:2510.13744v1 Announce Type: cross Abstract: Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently…

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

arXiv:2408.03459v5 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based…

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

arXiv:2510.12997v1 Announce Type: new Abstract: Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities…

AMORE: Adaptive Multi-Output Operator Network for Stiff Chemical Kinetics

arXiv:2510.12999v1 Announce Type: new Abstract: Time integration of stiff systems is a primary source of computational cost in combustion, hypersonics, and other reactive transport systems. This stiffness can introduce time scales significantly smaller than those associated with other physical processes,…

A Brain-to-Population Graph Learning Framework for Diagnosing Brain Disorders

arXiv:2506.16096v2 Announce Type: replace Abstract: Recent developed graph-based methods for diagnosing brain disorders using functional connectivity highly rely on predefined brain atlases, but overlook the rich information embedded within atlases and the confounding effects of site and phenotype variability. To…

Escaping Local Optima in the Waddington Landscape: A Multi-Stage TRPO-PPO Approach for Single-Cell Perturbation Analysis

arXiv:2510.13018v1 Announce Type: new Abstract: Modeling cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology. Existing data-driven framework have advanced perturbation prediction through variational autoencoders, chemically conditioned autoencoders, and large-scale transformer pretraining. However, these models…