Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation
arXiv:2602.20400v1 Announce Type: new Abstract: To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training…
