Kubernetes as AI Infrastructure: Google Cloud, llm-d, and the CNCF

2026-03-24 00:00 GMT · 15 hours ago aimagpro.com

At Google Cloud, serving the massive-scale needs of large foundation model builders and AI-native companies is at the forefront of our AI infrastructure strategy. As generative AI transitions to mission-critical production environments, these innovators require dynamic, relentlessly efficient infrastructure to overcome complex orchestration challenges and power an agentic future.
To meet this moment, we are thrilled to announce that llm-d has officially been accepted as a Cloud Native Computing Foundation (CNCF) Sandbox project. Google Cloud is proud to be a founding contributor to llm-d alongside Red Hat, IBM Research, CoreWeave, and NVIDIA, uniting around a clear, industry-defining vision: any model, any accelerator, any cloud.
This contribution underscores Google’s long-standing leadership in open-source innovation. And under the trusted stewardship of the Linux Foundation, we are helping ensure that the future of distributed AI inference is built on open standards rather than walled gardens. This gives foundation model builders the confidence to deploy their models globally without vendor lock-in, while empowering them to run the absolute best, most highly optimized implementations of these open technologies directly on Google Cloud.

Supercharging Kubernetes for inference
Kubernetes is the undisputed industry standard for orchestration. While it provides a rock-solid foundation, it wasn’t originally built for the highly stateful and dynamic demands of LLM inference. To evolve Kubernetes for this new class of workload, we launched GKE Inference Gateway, which provides native APIs to go far beyond simple load balancing. Under the hood, the gateway leverages the llm-d Endpoint Picker (EPP) for scheduling intelligence. By delegating routing decisions to llm-d, the system enforces a multi-objective policy that considers real-time KV-cache hit rates, the number of inflight requests, and instance queue depth to route each request to the most optimal backend for processing.
For foundation model builders operating at massive scale, the real-world impact of this model-aware routing is transformative. Recently, our Vertex AI team validated this architecture in production, proving its ability to handle highly unpredictable traffic without relying on fragile custom schedulers. For context-heavy coding tasks using Qwen Coder, Time-to-First-Token (TTFT) latency was slashed by over 35%. When handling bursty, stochastic chat workloads using DeepSeek for research, P95 tail latency improved by 52%, effectively absorbing severe load variance. Crucially, the gateway’s routing intelligence doubled Vertex AI’s prefix cache hit rate from 35% to 70%, drastically lowering re-computation overhead and cost-per-token.

Beyond intelligent routing, orchestrating multi-node AI deployments requires bulletproof underlying primitives, which is why Google leads the development of the Kubernetes LeaderWorkerSet (LWS) API. LWS enables llm-d to orchestrate wide expert parallelism and disaggregate compute-heavy prefill and memory-heavy decode phases into independently scalable pods. With its widespread industry adoption, LWS now orchestrates a rapidly growing footprint of production AI workloads, managing massive fleets of TPUs and GPUs at global scale. Complementing this orchestration, Google recently extended vLLM natively for Cloud TPUs. Featuring a unified PyTorch and JAX backend alongside innovations like Ragged Paged Attention v3, this integration delivers up to 5x throughput gains over our first release earlier last year. Together, whether you are scaling on Google Cloud TPUs or NVIDIA GPUs, these advancements help ensure state-of-the-art AI serving remains a highly optimized, accelerator-agnostic capability.
Building next-gen AI infrastructure together
To build the ultimate AI infrastructure, we must bridge the gap between cloud-native Kubernetes orchestration and frontier AI research. The shift to production-grade gen AI requires an engine built on trust, transparency, and deep collaboration with the AI/ML leaders pushing the boundaries of what is possible.
We are incredibly excited to partner with the Linux Foundation, the CNCF, the PyTorch Foundation, and the rest of the open-source community to build the next generation of AI infrastructure. By establishing “well-lit paths” — proven, replicable blueprints tested end-to-end under realistic load — we are ensuring that high-performance AI thrives as an open, universally accessible ecosystem that empowers innovation without boundaries.
We invite large foundation model builders, AI natives, platform engineers, and AI researchers to join us in shaping the open future of AI inference:

Explore the well-lit paths: Visit the llm-d guides to start deploying SOTA inference stacks on your infrastructure today.

Learn more: Check out the official website at https://llm-d.ai/ 

Contribute: Join the community on Slack and get involved in our GitHub repositories at https://github.com/llm-d/.

Join us in celebrating llm-d at the CNCF! We look forward to scaling the engine together.