Hybrid Models as First-Class Citizens in vLLM
Introduction and Agenda Large language models are now running into the scaling limits of attention. Even with highly optimized implementations, KV cache memory grows linearly with sequence length, and prefill…
