Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
arXiv:2503.01822v2 Announce Type: replace Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain…
