Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
arXiv:2511.02022v1 Announce Type: new Abstract: Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization…
