Autoscaling K8s GPU Workloads in Production: A Complete Guide

Date Published

Scaling GPU workloads isn’t like scaling traditional applications. GPUs are expensive, scarce, and suffer from cold start problems that can kill your user experience. Whether you’re running real-time inference for a customer-facing API or batch training jobs for your research team, getting autoscaling right means the difference between a $5,000 monthly bill and a $50,000 surprise.

This guide covers everything you need to know about GPU autoscaling — from the foundational principles to the “Golden Stack” architecture pattern that leading ML teams use in production. Then we’ll explore why an increasing number of teams are moving away from self-hosted solutions entirely.

Understanding GPU Autoscaling

Why GPU Autoscaling Is Different

Traditional CPU autoscaling is relatively straightforward: monitor CPU usage, add instances when it’s high, remove them when it’s low. With GPUs, this approach fails for several reasons:

1. GPU Utilization Is a Terrible Metric

A GPU showing 100% compute utilization might have an empty request queue, or it might be desperately overloaded. Conversely, 0% utilization could mean idle capacity or simply that the model is loading weights into VRAM.

Better metrics for inference:

  • Request queue depth
  • P95/P99 latency
  • Requests per second vs. capacity

Better metrics for training:

  • Job queue length
  • Dataset processing throughput
  • Training step time

2. Cold Starts Are Measured in Minutes, Not Seconds

When you need more GPU capacity:

  • Cloud providers take 3–5 minutes to provision new GPU instances
  • Container images for ML workloads are often 5–15GB
  • Model weights need to load into VRAM (another 30–120 seconds)

Total time to serve the first request: 5–8 minutes minimum

By comparison, CPU-based services typically achieve cold starts in 1–10 seconds.

3. Cost Asymmetry Is Extreme

An AWS t3.medium CPU instance costs ~$30/month. An AWS p4d.24xlarge with 8x A100 GPUs costs ~$32,000/month. A single idle GPU can waste more money in a day than a month of CPU overcapacity.

This makes both under-provisioning (poor user experience) and over-provisioning (budget destruction) unacceptable. You need precision.

The Golden Stack Architecture

The industry consensus for production GPU autoscaling on Kubernetes revolves around three core components:

Layer 1: Metrics and Observability

You cannot autoscale what you cannot measure.

NVIDIA DCGM Exporter (Mandatory): DCGMexports GPU-specific metrics that Kubernetes’ default metrics server knows nothing about:

1DCGM_FI_DEV_GPU_UTIL  - GPU compute utilization
2DCGM_FI_DEV_FB_USED  - Framebuffer (VRAM) memory used
3DCGM_FI_DEV_GPU_TEMP  - GPU temperature
4DCGM_FI_DEV_POWER_USAGE  - Power consumption

These metrics feed into Prometheus, which then exposes them to your autoscaling controllers via the Custom Metrics API.

The Prometheus Stack

You need:

  • Prometheus server to scrape DCGM metrics
  • Prometheus Adapter to expose metrics to Kubernetes Custom Metrics API
  • Grafana for visualization (optional but highly recommended)

Reality check: This is 3–4 separate deployments with interconnected configurations. Budget 2–3 days for setup and debugging.

Pod Autoscaling (Horizontal)

Different workload types require completely different strategies.

For Inference Workloads: KEDA (Event-Driven Autoscaling)

The standard Horizontal Pod Autoscaler (HPA) polls metrics every 15–30 seconds and uses a gradual scaling algorithm. This is too slow for spiky inference traffic.

KEDA (Kubernetes Event-Driven Autoscaling) reacts to events in real-time:

1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4 name: inference-scaler
5spec:
6 scaleTargetRef:
7 name: model-deployment
8 minReplicaCount: 0 # Scale to zero during idle
9 maxReplicaCount: 20
10 cooldownPeriod: 300 # Wait 5 min before scaling down
11 triggers:
12 - type: prometheus
13 metadata:
14 serverAddress: http://prometheus.monitoring.svc:9090
15 metricName: inference_queue_depth
16 query: |
17 sum(rate(inference_requests_pending[1m]))
18 threshold: "10"

Key KEDA advantages:

  • Scale to zero when no requests are pending (massive cost savings)
  • React to queue depth, not just resource utilization
  • Support for multiple trigger types (Kafka lag, RabbitMQ depth, HTTP requests)

Alternative: HPA with Custom Metrics

If you must use HPA, configure it with Prometheus Adapter:

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: gpu-inference-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: inference-deployment
10 minReplicas: 2
11 maxReplicas: 10
12 metrics:
13 - type: Pods
14 pods:
15 metric:
16 name: DCGM_FI_DEV_GPU_UTIL
17 target:
18 type: AverageValue
19 averageValue: "65" # Target 65% utilization

Warning: Setting the target too high (90%+) means new pods won’t be ready before users experience degradation. Setting it too low wastes money. 60–70% is the sweet spot for most workloads.

For Training Workloads: Job Queueing

Training jobs are typically fixed-size (e.g., “I need 8 GPUs for distributed training”). Don’t use HPA for these.

Better approaches:

  • Kueue — Kubernetes-native job queueing with fair-sharing and priority
  • Volcano — Batch scheduling system with gang scheduling (all-or-nothing)
  • Ray on Kubernetes — For complex multi-stage ML pipelines

These systems queue jobs until sufficient GPU resources are available, preventing hundreds of Pending pods from overwhelming your scheduler.

Node Autoscaling (Infrastructure)

This is where you win or lose on cost and performance.

Option A: Karpenter (Recommended for AWS/Azure)

The standard Cluster Autoscaler requires pre-defined “node groups” (e.g., a group for T4 GPUs, another for A100s). This forces you to:

  • Predict which GPU types you’ll need
  • Create and manage separate groups for each
  • Deal with bin-packing inefficiencies

Karpenter is “groupless”. When a pod requests a GPU, Karpenter:

  1. Reads the pod’s resource requirements
  2. Calls the cloud API to find the cheapest/fastest instance type that fits
  3. Provisions exactly that instance type on-demand

Example provisioner:

1apiVersion: karpenter.sh/v1alpha5
2kind: Provisioner
3metadata:
4 name: gpu-provisioner
5spec:
6 requirements:
7 - key: karpenter.sh/capacity-type
8 operator: In
9 values: ["spot", "on-demand"]
10 - key: node.kubernetes.io/instance-type
11 operator: In
12 values: ["g4dn.xlarge", "p3.2xlarge", "p4d.24xlarge"]
13 limits:
14 resources:
15 nvidia.com/gpu: 100
16 ttlSecondsAfterEmpty: 300
17 consolidation:
18 enabled: true

Key benefits:

  • Automatic bin-packing to maximize GPU utilization
  • Support for Spot instances with automatic fallback
  • Consolidation: moves pods to fewer nodes to reduce waste

Cluster Autoscaler (For GKE or Other Platforms)

If Karpenter isn’t available:

1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: cluster-autoscaler-config
5data:
6 node-groups: |
7 - name: gpu-t4-spot
8 minSize: 0
9 maxSize: 10
10 - name: gpu-a100-ondemand
11 minSize: 0
12 maxSize: 5

Critical: Use Taints and Tolerations: Apply NoSchedule taints to all GPU nodes:

1kubectl taint nodes -l gpu=true nvidia.com/gpu=present:NoSchedule

This prevents system pods (CoreDNS, monitoring agents) from stealing slots on expensive GPU nodes.

Your GPU workload pods need matching tolerations:

1tolerations:
2- key: nvidia.com/gpu
3 operator: Equal
4 value: present
5 effect: NoSchedule

Advanced GPU Optimization Techniques

Multi-Instance GPU (MIG)

Modern GPUs like A100 and H100 can be partitioned into up to 7 isolated instances.

Best for: Small inference models that don’t need a full GPU

Example partitioning:

  • 1x A100 (40GB) → 7x MIG slices (1g.5gb each)
  • Each slice acts as an independent GPU from Kubernetes’ perspective

Configuration:

1# Enable MIG mode on GPU
2sudo nvidia-smi -i 0 -mig 1
3
4# Create MIG instances
5sudo nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
6
7# Install NVIDIA GPU Operator with MIG support
8helm install gpu-operator nvidia/gpu-operator \
9 --set mig.strategy=single

Kubernetes now sees these as distinct resources:

1resources:
2 limits:
3 nvidia.com/mig-1g.5gb: 1

Cost impact: Run 7 workloads on hardware that previously served 1 = 7x better utilization.

GPU Time-Slicing (For Older GPUs)

T4 and V100 GPUs don’t support MIG, but you can enable time-sharing through NVIDIA MPS (Multi-Process Service).

Configuration in GPU Operator:

1devicePlugin:
2 config:
3 name: time-slicing-config
4 default: any
5 create: true
6 data:
7 any: |-
8 version: v1
9 sharing:
10 timeSlicing:
11 replicas: 4 # Allow 4 pods per GPU

Trade-off: Pods share GPU compute time. If all 4 pods are active simultaneously, each gets ~25% of GPU cycles. Works well for bursty inference workloads where pods are rarely all active at once.

Solving the Cold Start Problem

The 5–8 minute cold start for new GPU nodes is often the killer for user experience.

Technique 1: Overprovisioning with Balloon Pods

Run low-priority “pause” pods that reserve GPU capacity but do nothing (or use the pod to pre-pull your pre-baked image with model weights):

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: gpu-overprovisioner
5spec:
6 replicas: 2
7 template:
8 spec:
9 priorityClassName: overprovisioning # Low priority
10 containers:
11 - name: pause
12 image: k8s.gcr.io/pause:3.5
13 resources:
14 requests:
15 nvidia.com/gpu: 1

When a real inference pod needs capacity, Kubernetes evicts the pause pod instantly (milliseconds), and your real workload schedules immediately.

Cost: You’re paying for unused GPU capacity, but the user experience improvement often justifies it for critical workloads.

Technique 2: Model Weight Pre-loading

Bad approach: Download model weights from S3/GCS on every pod startup

1# DON'T DO THIS
2import boto3
3
4s3 = boto3.client("s3")
5s3.download_file("bucket", "model.pt", "/tmp/model.pt")
6
7model = torch.load("/tmp/model.pt")

Good approach: Bake weights into container image

1COPY model.pt /app/model.pt

Best approach: Use ReadOnlyMany (ROX) volume with a caching layer:

1volumes:
2- name: model-weights
3 persistentVolumeClaim:
4 claimName: model-weights-pvc
5 readOnly: true

Weights are cached on each node. New pods access them immediately.

Production Checklist

Before going live with GPU autoscaling:

Metrics & Monitoring:

  • DCGM Exporter installed and scraping all GPU nodes
  • Prometheus configured with appropriate retention (7–30 days)
  • Grafana dashboards showing GPU utilization, memory, queue depth
  • Alerts configured for: GPU OOM errors, thermal throttling, pod scheduling failures

Autoscaling Configuration:

  • KEDA or HPA configured with appropriate metrics (not just GPU utilization)
  • Cooldown periods tuned to avoid scaling thrash (300s is a good starting point)
  • Maximum replica limits set to prevent runaway costs
  • Karpenter or CA provisioners configured with appropriate instance types

GPU Optimization:

  • MIG or time-slicing enabled where appropriate
  • Taints applied to GPU nodes to prevent system pod interference
  • Resource requests and limits properly set on all GPU workloads

Cost Controls:

  • Spot instances configured for non-critical workloads
  • Scale-to-zero enabled during known idle periods
  • Budget alerts configured in cloud provider console

Disaster Recovery:

  • Multi-AZ GPU node pools (where supported)
  • Spot interruption handling tested
  • Model checkpoint strategies for long-running training jobs

The Self-Hosted Reality Check

Now that you understand what production GPU autoscaling requires, let’s be honest about what you’re signing up for.

The True Cost of Self-Hosted GPU Infrastructure

Time to Production: 3–6 weeks minimum for the following:

  • Karpenter/CA setup and testing
  • DCGM + Prometheus + Grafana configuration
  • KEDA or HPA tuning with custom metrics
  • MIG/time-slicing configuration
  • Cold start optimization strategies
  • Monitoring, alerting, and documentation

However once your infrastructure is setup and you are capable of handling AI/ML workloads you need to factor in also the ongoing maintenance burden for:

  • Kubernetes version upgrades (quarterly)
  • NVIDIA driver updates when new GPUs are added
  • DCGM exporter version compatibility issues
  • Karpenter/CA bug fixes and version updates
  • Prometheus storage management and cost optimization
  • On-call rotation for infrastructure issues (your scaling configuration tends to drift and perform poorly if the demands for your system grow rapidly)

Hidden Costs:

  • DevOps engineer time: 40–60% of one FTE minimum (~$60–80K annually)
  • Infrastructure waste from imperfect autoscaling: 15–30%
  • Cold start revenue loss during traffic spikes: hard to quantify, but real
  • Opportunity cost: engineers maintaining infrastructure instead of improving models

Common Pain Points Teams Hit

  • “Our DCGM metrics stopped flowing to Prometheus after a K8s upgrade”

This happens. Often. The debugging process involves checking: exporter pod logs, Prometheus service discovery, RBAC permissions, network policies, custom metrics API registration, and HPA/KEDA configurations.

Time to resolution: 2–4 hours if you’re lucky, 2 days if you’re not.

  • “Karpenter provisioned 10 massive instances during a spike, now we have a $15K surprise bill”

Autoscaling without proper limits is dangerous. You need sophisticated cost guardrails across multiple layers: provisioner limits, KEDA max replicas, cloud provider budgets, and real-time alerting.

  • “Our inference latency is terrible but GPU utilization is only at 40%”

GPU metrics are misleading. Your model might be CPU-bound (pre/post-processing), I/O-bound (data loading), or have batch size misconfigurations. Debugging requires deep expertise across the entire stack.

  • “Training jobs randomly fail because Spot instances get interrupted mid-run”

Spot instance handling requires: checkpoint/resume logic in your training code, node drain handlers, pod disruption budgets, and automatic job rescheduling configuration.

Each of these requires specialized knowledge.

Why Teams Are Choosing Serverless GPU Platforms

After reviewing what self-hosted GPU autoscaling truly requires, many teams, especially in small and growing startups, are reaching a simple conclusion: this isn’t our core competency, and it shouldn’t be.

The Serverless Alternative: GPU Shards

GPU Shards is a serverless inference platform that handles the entire “Golden Stack” behind the scenes. Instead of spending weeks configuring Karpenter, KEDA, and DCGM, you deploy your model and get a production-ready endpoint in minutes.

What GPU Shard Handles Automatically

1. Intelligent Autoscaling Without Configuration

No YAML. No ScaledObjects. No provisioners.

We scale based on actual request queue depth and P99 latency — not GPU utilization. When your queue starts growing, we provision capacity in under 500ms using pre-warmed GPU pools.

What this means: Handle traffic spikes without cold start delays or complex configuration.

2. GPU Multi-Tenancy Optimization

Upload your model. We analyze its memory footprint and automatically:

  • Provision MIG slices for small models (maximizing cost efficiency)
  • Enable time-slicing for compatible workloads
  • Allocate dedicated GPUs for large models

You get 7x better GPU utilization without touching the NVIDIA GPU Operator.

3. Built-in Observability

Every endpoint includes:

  • Request latency breakdown (queue → execution → network)
  • GPU memory usage and saturation metrics
  • Real-time cost per request
  • Error rates and automatic anomaly detection

No Prometheus setup. No Grafana configuration. No DCGM exporter debugging.

4. True Scale-to-Zero

Your endpoint scales down to zero during idle periods. Zero GPU costs at 3am. Zero baseline capacity waste.

When the first request arrives, we serve it from pre-warmed pools in <500ms — not 5 minutes.

5. Production-Grade Deployment Strategies

Blue-green deployments built in. Push a new model version. We:

  1. Route 5% of traffic to the new version
  2. Monitor error rates automatically
  3. Roll back instantly if issues are detected
  4. Gradually shift 100% of traffic if stable

No Kubernetes deployments to manage. No manual rollback procedures.

Real Cost Comparison: Medium Traffic Workload

Self-Hosted Monthly Costs:

  • 4x A100 instances (24/7 baseline for cold start mitigation): $12,000
  • Kubernetes control plane and node overhead: $500
  • Monitoring infrastructure (Prometheus + Grafana): $300
  • Data transfer and storage: $200
  • DevOps engineer allocation (50% FTE): $10,000
  • Total: $23,000/month

GPU Shard Monthly Costs:

  • Pay-per-request inference: $6,000
  • Monitoring and logging: Included
  • Autoscaling and optimization: Included
  • DevOps overhead: $0
  • Cost during idle periods: $0
  • Total: $6,000/month

Savings: $17,000/month (74% reduction)

When Self-Hosted Still Makes Sense

GPU Shard isn’t right for every team. You should consider self-hosted if:

  • You have dedicated ML infrastructure engineers who specialize in GPU optimization
  • You need absolute control over every aspect of the stack for compliance
  • You’re running massive scale (>100 GPUs continuously) where managed overhead exceeds DIY savings
  • You have extremely specialized hardware requirements not yet supported by serverless platforms
  • Your models require custom CUDA kernels with non-standard configurations

For most teams — especially those under 50 GPUs or without dedicated infrastructure specialists — serverless is the pragmatic choice.

Conclusion: Focus on Models, Not Infrastructure

GPU autoscaling in production is a solved problem — but solving it yourself requires weeks of engineering time and ongoing maintenance burden.

The choice is clear:

  • Self-hosted: Own the complexity, control every detail, employ dedicated infrastructure specialists.
  • GPU Shards: Deploy in minutes, scale automatically, let your ML team focus on improving models instead of debugging Kubernetes.

For most teams, the answer is obvious.


We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.

Learn more