Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
arXiv:2603.26846v1 Announce Type: new Abstract: As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based…
