Introducing the GKE standby buffer: Improve node startup times without blowing your budget

2026-06-01 07:00 GMT · 1 day ago aimagpro.com

Application owners and platform engineers have long faced a difficult choice: spend excessively by over-provisioning to guarantee quick startups, or minimize costs but endure slow cold starts.
We are excited to announce a solution to this compromise: Google Kubernetes Engine standby buffers. This builds on the launch of GKE active buffers earlier this year, a native version of the Kubernetes CapacityBuffers API that makes it easy to provision readily available capacity to handle traffic spikes, delivering near-zero startup latency for new pods. However, active buffers still impose a trade-off between performance and cost. New GKE standby buffers help by maintaining a low-cost, suspended capacity buffer for your GKE clusters. With a cost overhead in the low single-digit percent, GKE standby buffers help you achieve near-immediate scheduling for your workloads with negligible cost overhead. This is useful for all kinds of workloads — general-purpose, agentic, and everything in between.

Under identical traffic loads, the cluster without standby buffers suffered severe latency spikes, with P50, P95, and P99 metrics trapped between 4 and 6 minutes. Conversely, the cluster with standby buffers maintained a P50 latency of just single-digit seconds, while its P95 and P99 metrics briefly peaked at one minute before quickly normalizing to single-digit seconds. Both setups exhibited a similar allocatable core cost, making the buffered approach far more efficient.

The problem: High costs and latency
Traditionally, autoscaling with standard Kubernetes has been effective but slow. Traffic surges or batch jobs require cluster autoscalers to provision fresh nodes, leaving Pods in a pending state. To circumvent delays, you have to resort to clunky workarounds like lowering your Horizontal Pod Autoscaler (HPA) thresholds or managing so-called balloon pods. These workarounds are expensive: 

Managing balloon pods is operationally complex, requiring manual configuration and ongoing maintenance of priority classes and resource requests to ensure they function correctly.

Lowering the HPA threshold adds empty (wasted) space that linearly scales with the size of the node pool.

Both GKE active and standby buffers allow capacity to be defined declaratively, removing the need for clunky and operationally heavy workarounds.
In addition, GKE standby buffers lower infrastructure costs by storing the node’s state to disk, releasing compute and memory costs and keeping only persistent disk and IP address costs. Then, combined with an active buffer, you can achieve near-instant pod scheduling that has similar performance to over-provisioning, but at a very affordable price.

Introducing GKE Capacity Buffers – the native Kubernetes way to achieve low latency pod scheduling

Active and standby buffers working together
All GKE capacity buffers operate on a principle similar to video streaming on platforms like YouTube. By proactively attempting to provision and manage available capacity ahead of impending demand (much like pre-downloading video content) GKE helps to ensure that resources are readily available when they’re needed.
With today’s launch, the two types of capacity buffers can work in harmony:

Active buffer: Cluster Autoscaler works to reserve enough capacity for a predefined amount of pods on existing cluster nodes, and, if needed, provisions extra nodes. Select this ready-to-use buffer to provide capacity to your most latency-sensitive workloads. 

Standby buffers: Nodes are pre-provisioned and fully initialized with necessary components like Kubernetes DaemonSets, and given time to preload images, but are then suspended, while the underlying compute capacity is released to save costs. When demand spikes, these nodes resume 2-3x faster than creating a fresh node, bridging the gap between cold starts and always-on capacity.

The active buffer covers the initial spike until standby buffers resume. The system prioritizes refilling the active buffer from the standby buffer. The standby buffer handles an extended load and protects against slower node cold starts. As standby buffers refill, they initially kick into an active state for a configurable amount of time before they are suspended, providing a boost of active capacity during sustained traffic loads.
Early benchmarks
In our tests, using standby buffers enabled us to deliver sub-second Agent Sandbox scheduling latency for up to 90% lower cost compared to complete overprovisioning.

Optimized for business needs
Businesses are under constant pressure to optimize resource consumption while streamlining operations. Recognizing that organizations need smarter tools to manage sporadic and spikey workloads, we worked hard to deliver standby buffers quickly. Now, whether you’re running agents, batch jobs, CI/CD pipelines, game servers, or spiky workloads, GKE capacity buffers allow you to dynamically balance performance and cost. You can finally define your “insurance policy” against traffic spikes without paying a high premium for it. With GKE standby buffers you can:

Circumvent cold starts: Nodes suspended by standby buffers resume 2-3x faster than provisioning fresh nodes, reducing pod scheduling latency during traffic spikes and sustained traffic load.

Enjoy lower costs: A standby buffer incurs a fraction of the cost of active capacity because the underlying VM is suspended. You pay for storage and an IP address, rather than for full compute-hours.

Gain declarative control: Replace complex balloon pod workarounds with the simple, native declarative CapacityBuffers API, explicitly stating how much headroom you need, and letting GKE handle the rest.

“Using GKE standby capacity buffers has lowered our time-to-ready from several minutes to 30 seconds at a very affordable price.” – Pedro Spagiari, Chief Architect at Unico

Get started
Ready to improve your performance and save on costs?

Start by defining a CapacityBuffer resource in your cluster to specify your target buffer size.

Try balancing between standby buffers to reduce pod scheduling latency for sustained loads, and active buffers to address immediate unpredictable capacity needs.

Let’s look at an example of how to configure buffers for a Deployment while also using custom ComputeClasses.
Basic setup
Beginning with some basic setup, create a namespace:

code_block
<ListValue: [StructValue([('code', 'apiVersion: v1rnkind: Namespacernmetadata:rn name: my-namespace'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f8181753c40>)])]>

Then, create a custom ComputeClass (optional):

code_block
<ListValue: [StructValue([('code', 'apiVersion: cloud.google.com/v1rnkind: ComputeClassrnmetadata:rn name: my-cccrn namespace: my-namespacernspec:rn # Buffers will also be created according to these priorities rn priorities:rn – machineFamily: n4rn – machineFamily: n4drn – machineFamily: c4rn – machineFamily: c4drn nodePoolAutoCreation:rn enabled: true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f8180df1e80>)])]>

Define the buffer unit size
You can use a PodTemplate as a reference for the buffer unit size. You can also create a buffer for a  specific deployment or any object that defines scale subResource.

code_block
<ListValue: [StructValue([('code', '# Defines the resource requirements for one unit of buffer.rnapiVersion: v1rnkind: PodTemplaternmetadata:rn name: my-buffer-unit-templatern namespace: my-namespacerntemplate:rn spec:rn terminationGracePeriodSeconds: 0rn tolerations:rn # Optional: Ensures buffer pods can land on any node.rn – key: "node-role.kubernetes.io/master"rn operator: "Exists"rn effect: "NoSchedule"rn containers:rn – name: buffer-containerrn image: registry.k8s.io/pause:3.9rn resources:rn requests:rn cpu: "1"rn memory: "1Gi"rn limits:rn cpu: "1"rn memory: "1Gi"rn # Optional: Using buffers with a custom ComputeClass / rn # controls the properties of the nodes GKE provisions. rn nodeSelector:rn cloud.google.com/compute-class: my-ccc'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f8180df1f70>)])]>

Create buffers
Lastly, create a CapacityBuffer object by referring to our PodTemplate. Here, you create a standby buffer of 50 CPUs and 50 GB of RAM:

code_block
<ListValue: [StructValue([('code', 'apiVersion: autoscaling.x-k8s.io/v1beta1rnkind: CapacityBufferrnmetadata:rn name: my-standby-buffer-resource-limitsrn namespace: my-namespacern annotations:rn # Optional: Time after which buffer nodes are suspended.rn # Default is 5 minutes. rn buffer.gke.io/standby-capacity-init-time: "5m"rn # Optional: Time after which standby buffers are recreated.rn # Default is 1 day, "never" avoids refreshing. rn buffer.gke.io/standby-capacity-refresh-frequency: "1d"rnspec:rn podTemplateRef:rn name: my-buffer-unit-templatern # The desired state is 20 standby buffer units.rn # When a standby buffer gets used, a new one gets created.rn limits:rn cpu: "50"rn memory: "50Gi"rn provisioningStrategy: "buffer.gke.io/standby-capacity"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f81802a39d0>)])]>

And an active buffer of seven 5 CPUs and 5 GB of RAM (optional):

code_block
<ListValue: [StructValue([('code', 'apiVersion: autoscaling.x-k8s.io/v1beta1rnkind: CapacityBufferrnmetadata:rn name: my-active-buffer-resource-limitsrn namespace: my-namespacernspec:rn podTemplateRef:rn name: my-buffer-unit-templatern # The desired state is 2 active buffer units.rn # When an active buffer gets used, a new one gets created. rn limits:rn cpu: "5"rn memory: "5Gi"rn provisioningStrategy: "buffer.x-k8s.io/active-capacity"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f81802a3e80>)])]>

Finally, apply the above objects to your cluster. That’s it!
Now, any existing and future deployments that can schedule on the space reserved by the buffers will benefit from faster pod scheduling latencies.
Test the buffers
You can check on the status of your buffers. In Kubernetes, suspended nodes can be identified by condition Suspended.

code_block
<ListValue: [StructValue([('code', 'kubectl get nodes -o custom-columns='NAME:.metadata.name,SUSPENDED:.status.conditions[?(@.type=="Suspended")].status''), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f81802a30a0>)])]>

Expect the following kind of output, and wait for the standby buffers to get suspended.

code_block
<ListValue: [StructValue([('code', 'NAME SUSPENDEDrngke-my-cluster-nap-n4-standard-8-k960-…-ffbx False # Node has been resumed.rngke-my-cluster-nap-n4-standard-4-k960-…-h2x4 <none> # Node was never suspended.rngke-my-cluster-nap-n4d-standard-8-1cip-…-74jf True # Node is suspended.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f81802a3490>)])]>

To test the buffers, create a deployment and scale it.

code_block
<ListValue: [StructValue([('code', 'apiVersion: apps/v1rnkind: Deploymentrnmetadata:rn name: my-deploymentrn namespace: my-namespacernspec:rn replicas: 1rn selector:rn matchLabels:rn app: my-deploymentrn template:rn metadata:rn labels:rn app: my-deploymentrn spec:rn containers:rn – name: busyboxrn image: busyboxrn command: ["sleep", "inf"]rn resources:rn requests:rn cpu: "500m"rn memory: "500Mi"rn # Optional: Using buffers with a custom ComputeClass /rn # controls the properties of the nodes GKE provisions. rn nodeSelector:rn cloud.google.com/compute-class: my-ccc'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f81802a3cd0>)])]>

Scaling this deployment to two replicas allows them to be assigned to the active buffer for immediate scheduling. The active buffer is then immediately refilled from the standby buffer. Simultaneously, the standby buffer initiates the provisioning of new nodes.
If you further scale the deployment to 50 replicas, scheduling all of them on the standby buffer occurs once the nodes resume. New nodes provisioned to refill the standby buffer briefly function as active buffers providing a temporary active standby boost. Therefore, when further scaling the deployment to 100 replicas during this time, you may notice that new replicas benefit from immediate scheduling.
GKE standby buffer best practices
When working with GKE standby buffers, here are a few things to consider:

Define standby buffers that are sufficient to cover the extended load you expect to encounter, so that buffers can refill in the background from a cold start. A sufficiently sized standby buffer can drop your max pod scheduling latency to the time it takes to resume a node — around 30 seconds.

When the buffer starts to get used and is refilled, new buffer nodes initially swing into an active state prior to suspending. This helps to boost active capacity during a prolonged load.

If your application requires the lowest possible pod scheduling latency, define an active buffer size that is sufficient to cover any initial spikes you expect to encounter until standby buffer nodes are able to resume. The system prioritizes refilling the active buffer by consuming the standby buffer. A sufficiently sized active buffer and a sufficiently sized standby buffer can help you achieve one-second pod scheduling latency for a fraction of the cost of overprovisioning.

Experiment with different buffer sizes to get the best result for your workload.

To help, we created a simulator to help with sizing the buffers to achieve your performance targets, available at https://github.com/gke-labs/buffers-simulator. 
Try it yourself!
Active and standby buffers in GKE provide a native solution for low-latency and cost-effective workload scaling by maintaining warm and standby capacity buffers. By circumventing slow node cold starts, buffers help performance-critical applications handle sudden traffic spikes. This feature replaces complex manual workarounds like balloon pods with a simple, declarative API, and allows for fixed, percentage-based, or resource-limited buffering strategies to help maintain strict service-level objectives cost-effectively and without over-provisioning for peak.
Standby buffers are available for GKE clusters running version 1.36.0-gke.2253000 or later. To get started with buffers, check out the documentation.