The Effect of Attention Head Count on Transformer Approximation
arXiv:2510.06662v1 Announce Type: cross Abstract: Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular…
