Archives AI News

Preference Leakage: A Contamination Problem in LLM-as-a-judge

arXiv:2502.01534v3 Announce Type: replace Abstract: Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little…

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

arXiv:2507.09768v3 Announce Type: replace Abstract: In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter…

Circuit Insights: Towards Interpretability Beyond Activations

arXiv:2510.14936v2 Announce Type: replace Abstract: The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection…

Solving adversarial examples requires solving exponential misalignment

arXiv:2603.03507v1 Announce Type: new Abstract: Adversarial attacks – input perturbations imperceptible to humans that fool neural networks – remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze…

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

arXiv:2505.20065v2 Announce Type: replace Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF),…