Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

2026-03-05 20:00 GMT · 4 months ago aimagpro.com

arXiv:2603.04427v1 Announce Type: new
Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (emph{selection}), while values carry rich semantic representations (emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $BigO(log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $sim!log_2 N$ dimensions, (3–4)~WikiText-2 and WikiText-103 language modeling where $dselect = dmodel/4$ incurs only 4.3% perplexity increase while reducing QK parameters by 75%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75% key cache savings at just 2.0% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75% key cache savings at $<$2% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25,GB of KV cache per user, enabling approximately 60% more concurrent users on the same GPU.

No results