Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway

2026-06-01 22:00 GMT · 2 days ago aimagpro.com

What happens when your workload fails in one region but you need access to service? This is a common case for availability and uptime. With recent enhancement to the Kubernetes ecosystem and capabilities like Dynamic Resource Allocation (DRA) and Inference Gateway. I decided to experiment with these capabilities in Google Cloud for a simple test using an AI inference workload.
In this blog, we will explore this setup and you can also jump straight into the detailed configs in this codelab Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET.
Building blocks
To build out this experiment, use the following products, features, and tools:

Google Kubernetes Engine (GKE) managed DRANET: This is a managed feature that lets you request and share resources among Pods. This supports GPUs, and TPUs. In this test TPUs were used in two different regions with networking assigned using managed DRANET.

Multi-cluster GKE Inference gateway: Load balances your AI/ML inference workloads across multiple GKE clusters. This works in a failover situation which is what my experiment intended to test. The type which supports this is the Multi-cluster Cross-region internal Application Load Balancer gke-l7-cross-regional-internal-managed-mc

Cloud Storage FUSE: Provides a way to store data, models, checkpoints, and logs directly in Cloud Storage. To speed up the deployment, an open source gemma model was downloaded to this storage for retrieval. 

Virtual private Cloud (VPC): The foundational global network providing isolated, secure communication for the internal load balancers and compute nodes

GKE Fleets: Fleets group the separate regional clusters under a unified management control plane

TPU v6e: Google’s custom AI accelerators that provide the high-performance compute required to serve the model. The VM family type used was the  ct6e-standard-4t in a 2×2 Slice

Design pattern example
The aim is to deploy a LLM model (Gemma 3) onto 2 GKE clusters in different regions. Each cluster will use 4 TPU v6e chips. The model should be stored in Cloud Storage. The workload is served using GKE Inference Gateway which supports multi-clusters. The traffic should be routed to the region closest to the user and failover to the other region if one region fails.

Putting it together

To get access to the TPUs for your project in two regions you have to ensure you have the necessary quota in those regions.

 

Begin: Set up the environment. 

Create a standard VPC, with firewall rules and subnet in the same zone as the reservation.

Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway

Set up firewall rules allowing traffic and health checks.

Reserve static internal IP addresses in both regions for the Gateway.

Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account. Bind this to a Kubernetes Workload Identity so your pods can securely mount the bucket and read the model weights directly.

Next: Create standard GKE clusters and node pools.

Deploy two separate GKE clusters in your chosen regions configured.

Enable the Gateway API (–gateway-api=standard) and the Cloud Storage FUSE CSI driver (–addons GcsFuseCsiDriver) during cluster creation.

Create dedicated TPU v6e node pools (ct6e-standard-4t) for both clusters.

Enable managed DRANET on these TPU node pools by setting the flags —accelerator-network-profile=auto, and –node-labels=cloud.google.com/gke-networking-dra-driver=true

Next: Establish the global mesh via Fleet Registration.

Register both GKE clusters to a unified GKE Fleet by following the fleet creation and registration setup.

Enable Multi-Cluster Service Discovery and Multi-Cluster Ingress on your fleet.

Designate your primary region as the configuration hub to act as the control plane for routing rules across both regions.

Next: Deploy the AI workload.

Use a temporary Kubernetes job to download the Gemma 3 (gemma-3-27b-it) model weights directly into your Cloud Storage bucket.

Define a ResourceClaimTemplate that explicitly requests the managed DRANET device class (deviceClassName: netdev.google.com ) with the allocation mode set to “All”.

code_block
<ListValue: [StructValue([('code', 'apiVersion: resource.k8s.io/v1rnkind: ResourceClaimTemplaternmetadata:rn name: all-netdevrn namespace: defaultrnspec:rn spec:rn devices:rn requests:rn – name: req-netdevrn exactly:rn deviceClassName: netdev.google.comrn allocationMode: All'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fa43089fbb0>)])]>

Deploy your inference server (e.g. vLLM) on the TPU nodes in both regions. Ensure the pod spec utilizes node selectors for the 2×2 TPU topology, requests exactly 4 TPUs, and mounts the netdev claim. This guarantees your pods utilize the dedicated accelerator networking alongside standard Ethernet.

Next: Configure the Multi-Cluster Inference Gateway.

Install the necessary Custom Resource Definitions (CRDs) so Kubernetes can process specialized routing objects like the InferenceObjective.

Deploy an AutoscalingMetric to track hardware utilization, such as KV cache usage.

Use Helm to group the independent AI deployments from both regions into a single, logical InferencePool.

Deploy the Cross-Region Gateway and its associated HTTPRoute to manage incoming global traffic.

Apply health checks and backend policies to the pool to ensure load balancing relies on your custom hardware metrics.

Configure an InferenceObjective to instruct the gateway to route prompts to the region with the highest availability, avoiding overloaded TPUs.

code_block
<ListValue: [StructValue([('code', 'apiVersion: gateway.networking.k8s.io/v1rnkind: Gatewayrnmetadata:rn name: cross-region-gatewayrn namespace: defaultrnspec:rn gatewayClassName: gke-l7-cross-regional-internal-managed-mcrn addresses:rn – type: networking.gke.io/named-address-with-regionrn value: "regions/europe-west4/addresses/gemma-gateway-ip-europe-west4"rn – type: networking.gke.io/named-address-with-regionrn value: "regions/us-east5/addresses/gemma-gateway-ip-us-east5"rn listeners:rn – name: httprn protocol: HTTPrn port: 80rn—rnapiVersion: gateway.networking.k8s.io/v1rnkind: HTTPRouternmetadata:rn name: gemma-routern namespace: defaultrnspec:rn parentRefs:rn – name: cross-region-gatewayrn kind: Gatewayrn rules:rn – backendRefs:rn – group: networking.gke.iorn kind: GCPInferencePoolImportrn name: gemma-poolrn port: 8000rn—rnapiVersion: networking.gke.io/v1rnkind: HealthCheckPolicyrnmetadata:rn name: gemma-health-checkrn namespace: defaultrnspec:rn targetRef:rn group: networking.gke.iorn kind: GCPInferencePoolImportrn name: gemma-poolrn default:rn config:rn type: HTTPrn httpHealthCheck:rn requestPath: /healthrn port: 8000rn—rnapiVersion: networking.gke.io/v1rnkind: GCPBackendPolicyrnmetadata:rn name: gemma-backend-policyrn namespace: defaultrnspec:rn targetRef:rn group: networking.gke.iorn kind: GCPInferencePoolImportrn name: gemma-poolrn default:rn timeoutSec: 100rn balancingMode: CUSTOM_METRICSrn trafficDuration: LONGrn customMetrics:rn – name: gke.named_metrics.tpu-cachern dryRun: falsern maxUtilizationPercent: 60rn—rnapiVersion: autoscaling.gke.io/v1beta1rnkind: AutoscalingMetricrnmetadata:rn name: tpu-cachern namespace: defaultrnspec:rn selector:rn matchLabels:rn app: gemma-serverrn endpoints:rn – port: 8000rn path: /metricsrn metrics:rn – name: vllm:kv_cache_usage_percrn exportName: tpu-cachern—rnapiVersion: inference.networking.x-k8s.io/v1alpha2rnkind: InferenceObjectivernmetadata:rn name: gemma-objectivern namespace: defaultrnspec:rn priority: 10rn poolRef:rn name: gemma-poolrn group: "inference.networking.k8s.io"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fa43089f4c0>)])]>

Testing the Failover

Verify the highly available architecture by simulating a primary region outage. Once the primary deployment is taken offline, the Gateway automatically detects the failure and seamlessly reroutes all subsequent user requests to the active secondary cluster, ensuring continuous availability without dropping traffic.

 
Next Steps

Take a deeper dive into a hands-on codelab and more information on these features review the following.

Hands-on Codelab: Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET

Document set: DRANET

Documentation: AI Hypercomputer

Want to ask a question, find out more or share a thought? Please connect with me on Linkedin.