Murphys Laws of AI Alignment: Why the Gap Always Wins
arXiv:2509.05381v3 Announce Type: replace-cross Abstract: We study reinforcement learning from human feedback under misspecification. Sometimes human feedback is systematically wrong on certain types of inputs, like a broken compass that points the wrong way in specific regions. We prove that…
