Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models
arXiv:2603.05773v2 Announce Type: replace-cross Abstract: Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the textbf{underline{D}}isentangled textbf{underline{S}}afety textbf{underline{H}}ypothesis textbf{(DSH)}, positing that…
