reward-lens: A Mechanistic Interpretability Library for Reward Models
arXiv:2604.26130v1 Announce Type: new Abstract: Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit — logit lens, direct logit attribution, activation patching, sparse autoencoders — was built for generative LLMs whose primitives all project…
