Geometry-Guided Adversarial Prompt Detection via Curvature and Local Intrinsic Dimension
arXiv:2503.03502v2 Announce Type: replace-cross Abstract: Adversarial prompts are capable of jailbreaking frontier large language models (LLMs) and inducing undesirable behaviours, posing a significant obstacle to their safe deployment. Current mitigation strategies primarily rely on activating built-in defence mechanisms or fine-tuning…
