arXiv:2603.05773v2 Announce Type: replace-cross
Abstract: Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the textbf{underline{D}}isentangled textbf{underline{S}}afety textbf{underline{H}}ypothesis textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a textit{Recognition Axis} ($mathbf{v}_H$, “Knowing”) and an textit{Execution Axis} ($mathbf{v}_R$, “Acting”). Our geometric analysis reveals a universal “Reflex-to-Dissociation” evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce textit{Double-Difference Extraction} and textit{Adaptive Causal Steering}. Using our curated textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of “Knowing without Acting.” Crucially, we leverage this disentanglement to propose the textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the textit{Explicit Semantic Control} of Llama3.1 with the textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.
