Archives AI News

Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

arXiv:2601.03420v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive…

Compact Example-Based Explanations for Language Models

arXiv:2601.03786v1 Announce Type: cross Abstract: Training data influence estimation methods quantify the contribution of training documents to a model’s output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small…

Spectral Archaeology: The Causal Topology of Model Evolution

arXiv:2601.03424v1 Announce Type: new Abstract: Behavioral benchmarks tell us textit{what} a model does, but not textit{how}. We introduce a training-free mechanistic probe using attention-graph spectra. Treating each layer as a token graph, we compute algebraic connectivity ($lambda_2$), smoothness, and spectral…

Current Agents Fail to Leverage World Model as Tool for Foresight

arXiv:2601.03905v1 Announce Type: cross Abstract: Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee…