Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
arXiv:2506.07985v2 Announce Type: replace-cross Abstract: Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are,…
