SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs
arXiv:2511.07572v1 Announce Type: new Abstract: Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer’s activations. However, SAEs trained in isolation don’t encourage sparse cross-layer…
