SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals
arXiv:2502.01042v5 Announce Type: replace Abstract: Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully…
