Subliminal Signals in Preference Labels
arXiv:2603.01204v2 Announce Type: replace Abstract: As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other’s training. A core assumption is that binary preference labels provide only semantic supervision about response…
