Output Supervision Can Obfuscate the Chain of Thought
arXiv:2511.11584v1 Announce Type: new Abstract: OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output…
