Autoscaling K8s GPU Workloads in Production: Complete Guide | GPU Shards

Scaling GPU workloads isn’t like scaling traditional applications. GPUs are expensive, scarce, and suffer from cold start problems that can kill your user experience. Whether you’re running real-time inference for a customer-facing API or batch training jobs for your research team, getting autoscaling right means the difference between a $5,000 monthly bill and a $50,000 surprise.

This guide covers everything you need to know about GPU autoscaling — from the foundational principles to the “Golden Stack” architecture pattern that leading ML teams use in production. Then we’ll explore why an increasing number of teams are moving away from self-hosted solutions entirely.

Understanding GPU Autoscaling

Why GPU Autoscaling Is Different

Traditional CPU autoscaling is relatively straightforward: monitor CPU usage, add instances when it’s high, remove them when it’s low. With GPUs, this approach fails for several reasons:

1. GPU Utilization Is a Terrible Metric

A GPU showing 100% compute utilization might have an empty request queue, or it might be desperately overloaded. Conversely, 0% utilization could mean idle capacity or simply that the model is loading weights into VRAM.

Better metrics for inference:

Request queue depth
P95/P99 latency
Requests per second vs. capacity

Better metrics for training:

Job queue length
Dataset processing throughput
Training step time

2. Cold Starts Are Measured in Minutes, Not Seconds

When you need more GPU capacity:

Cloud providers take 3–5 minutes to provision new GPU instances
Container images for ML workloads are often 5–15GB
Model weights need to load into VRAM (another 30–120 seconds)

Total time to serve the first request: 5–8 minutes minimum

By comparison, CPU-based services typically achieve cold starts in 1–10 seconds.

3. Cost Asymmetry Is Extreme

An AWS t3.medium CPU instance costs ~$30/month. An AWS p4d.24xlarge with 8x A100 GPUs costs ~$32,000/month. A single idle GPU can waste more money in a day than a month of CPU overcapacity.

This makes both under-provisioning (poor user experience) and over-provisioning (budget destruction) unacceptable. You need precision.

The Golden Stack Architecture

The industry consensus for production GPU autoscaling on Kubernetes revolves around three core components:

Layer 1: Metrics and Observability

You cannot autoscale what you cannot measure.

NVIDIA DCGM Exporter (Mandatory): DCGMexports GPU-specific metrics that Kubernetes’ default metrics server knows nothing about:

1DCGM_FI_DEV_GPU_UTIL     - GPU compute utilization
2DCGM_FI_DEV_FB_USED      - Framebuffer (VRAM) memory used
3DCGM_FI_DEV_GPU_TEMP     - GPU temperature
4DCGM_FI_DEV_POWER_USAGE  - Power consumption

These metrics feed into Prometheus, which then exposes them to your autoscaling controllers via the Custom Metrics API.

The Prometheus Stack

You need:

Prometheus server to scrape DCGM metrics
Prometheus Adapter to expose metrics to Kubernetes Custom Metrics API
Grafana for visualization (optional but highly recommended)

Reality check: This is 3–4 separate deployments with interconnected configurations. Budget 2–3 days for setup and debugging.

Pod Autoscaling (Horizontal)

Different workload types require completely different strategies.

For Inference Workloads: KEDA (Event-Driven Autoscaling)

The standard Horizontal Pod Autoscaler (HPA) polls metrics every 15–30 seconds and uses a gradual scaling algorithm. This is too slow for spiky inference traffic.

KEDA (Kubernetes Event-Driven Autoscaling) reacts to events in real-time:

1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4  name: inference-scaler
5spec:
6  scaleTargetRef:
7    name: model-deployment
8  minReplicaCount: 0          # Scale to zero during idle
9  maxReplicaCount: 20
10  cooldownPeriod: 300         # Wait 5 min before scaling down
11  triggers:
12  - type: prometheus
13    metadata:
14      serverAddress: http://prometheus.monitoring.svc:9090
15      metricName: inference_queue_depth
16      query: |
17        sum(rate(inference_requests_pending[1m]))
18      threshold: "10"

Key KEDA advantages:

Scale to zero when no requests are pending (massive cost savings)
React to queue depth, not just resource utilization
Support for multiple trigger types (Kafka lag, RabbitMQ depth, HTTP requests)

Alternative: HPA with Custom Metrics

If you must use HPA, configure it with Prometheus Adapter:

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: gpu-inference-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: inference-deployment
10  minReplicas: 2
11  maxReplicas: 10
12  metrics:
13  - type: Pods
14    pods:
15      metric:
16        name: DCGM_FI_DEV_GPU_UTIL
17      target:
18        type: AverageValue
19        averageValue: "65"    # Target 65% utilization

Warning: Setting the target too high (90%+) means new pods won’t be ready before users experience degradation. Setting it too low wastes money. 60–70% is the sweet spot for most workloads.

For Training Workloads: Job Queueing

Training jobs are typically fixed-size (e.g., “I need 8 GPUs for distributed training”). Don’t use HPA for these.

Better approaches:

Kueue — Kubernetes-native job queueing with fair-sharing and priority
Volcano — Batch scheduling system with gang scheduling (all-or-nothing)
Ray on Kubernetes — For complex multi-stage ML pipelines

These systems queue jobs until sufficient GPU resources are available, preventing hundreds of Pending pods from overwhelming your scheduler.

Node Autoscaling (Infrastructure)

This is where you win or lose on cost and performance.

Option A: Karpenter (Recommended for AWS/Azure)

The standard Cluster Autoscaler requires pre-defined “node groups” (e.g., a group for T4 GPUs, another for A100s). This forces you to:

Predict which GPU types you’ll need
Create and manage separate groups for each
Deal with bin-packing inefficiencies

Karpenter is “groupless”. When a pod requests a GPU, Karpenter:

Reads the pod’s resource requirements
Calls the cloud API to find the cheapest/fastest instance type that fits
Provisions exactly that instance type on-demand

Example provisioner:

1apiVersion: karpenter.sh/v1alpha5
2kind: Provisioner
3metadata:
4  name: gpu-provisioner
5spec:
6  requirements:
7  - key: karpenter.sh/capacity-type
8    operator: In
9    values: ["spot", "on-demand"]
10  - key: node.kubernetes.io/instance-type
11    operator: In
12    values: ["g4dn.xlarge", "p3.2xlarge", "p4d.24xlarge"]
13  limits:
14    resources:
15      nvidia.com/gpu: 100
16  ttlSecondsAfterEmpty: 300
17  consolidation:
18    enabled: true

Key benefits:

Automatic bin-packing to maximize GPU utilization
Support for Spot instances with automatic fallback
Consolidation: moves pods to fewer nodes to reduce waste

Cluster Autoscaler (For GKE or Other Platforms)

If Karpenter isn’t available:

1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: cluster-autoscaler-config
5data:
6  node-groups: |
7    - name: gpu-t4-spot
8      minSize: 0
9      maxSize: 10
10    - name: gpu-a100-ondemand
11      minSize: 0
12      maxSize: 5

Critical: Use Taints and Tolerations: Apply NoSchedule taints to all GPU nodes:

1kubectl taint nodes -l gpu=true nvidia.com/gpu=present:NoSchedule

This prevents system pods (CoreDNS, monitoring agents) from stealing slots on expensive GPU nodes.

Your GPU workload pods need matching tolerations:

1tolerations:
2- key: nvidia.com/gpu
3  operator: Equal
4  value: present
5  effect: NoSchedule

Advanced GPU Optimization Techniques

Multi-Instance GPU (MIG)

Modern GPUs like A100 and H100 can be partitioned into up to 7 isolated instances.

Best for: Small inference models that don’t need a full GPU

Example partitioning:

1x A100 (40GB) → 7x MIG slices (1g.5gb each)
Each slice acts as an independent GPU from Kubernetes’ perspective

Configuration:

1# Enable MIG mode on GPU
2sudo nvidia-smi -i 0 -mig 1
3
4# Create MIG instances
5sudo nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
6
7# Install NVIDIA GPU Operator with MIG support
8helm install gpu-operator nvidia/gpu-operator \
9  --set mig.strategy=single

Kubernetes now sees these as distinct resources:

1resources:
2  limits:
3    nvidia.com/mig-1g.5gb: 1

Cost impact: Run 7 workloads on hardware that previously served 1 = 7x better utilization.

GPU Time-Slicing (For Older GPUs)

T4 and V100 GPUs don’t support MIG, but you can enable time-sharing through NVIDIA MPS (Multi-Process Service).

Configuration in GPU Operator:

1devicePlugin:
2  config:
3    name: time-slicing-config
4    default: any
5    create: true
6    data:
7      any: |-
8        version: v1
9        sharing:
10          timeSlicing:
11            replicas: 4    # Allow 4 pods per GPU

Trade-off: Pods share GPU compute time. If all 4 pods are active simultaneously, each gets ~25% of GPU cycles. Works well for bursty inference workloads where pods are rarely all active at once.

Solving the Cold Start Problem

The 5–8 minute cold start for new GPU nodes is often the killer for user experience.

Technique 1: Overprovisioning with Balloon Pods

Run low-priority “pause” pods that reserve GPU capacity but do nothing (or use the pod to pre-pull your pre-baked image with model weights):

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: gpu-overprovisioner
5spec:
6  replicas: 2
7  template:
8    spec:
9      priorityClassName: overprovisioning    # Low priority
10      containers:
11      - name: pause
12        image: k8s.gcr.io/pause:3.5
13        resources:
14          requests:
15            nvidia.com/gpu: 1

When a real inference pod needs capacity, Kubernetes evicts the pause pod instantly (milliseconds), and your real workload schedules immediately.

Cost: You’re paying for unused GPU capacity, but the user experience improvement often justifies it for critical workloads.

Technique 2: Model Weight Pre-loading

❌ Bad approach: Download model weights from S3/GCS on every pod startup

1# DON'T DO THIS
2import boto3
3
4s3 = boto3.client("s3")
5s3.download_file("bucket", "model.pt", "/tmp/model.pt")
6
7model = torch.load("/tmp/model.pt")

✅ Good approach: Bake weights into container image

1COPY model.pt /app/model.pt

Best approach: Use ReadOnlyMany (ROX) volume with a caching layer:

1volumes:
2- name: model-weights
3  persistentVolumeClaim:
4    claimName: model-weights-pvc
5    readOnly: true

Weights are cached on each node. New pods access them immediately.

Production Checklist

Before going live with GPU autoscaling:

Metrics & Monitoring:

DCGM Exporter installed and scraping all GPU nodes
Prometheus configured with appropriate retention (7–30 days)
Grafana dashboards showing GPU utilization, memory, queue depth
Alerts configured for: GPU OOM errors, thermal throttling, pod scheduling failures

Autoscaling Configuration:

KEDA or HPA configured with appropriate metrics (not just GPU utilization)
Cooldown periods tuned to avoid scaling thrash (300s is a good starting point)
Maximum replica limits set to prevent runaway costs
Karpenter or CA provisioners configured with appropriate instance types

GPU Optimization:

MIG or time-slicing enabled where appropriate
Taints applied to GPU nodes to prevent system pod interference
Resource requests and limits properly set on all GPU workloads

Cost Controls:

Spot instances configured for non-critical workloads
Scale-to-zero enabled during known idle periods
Budget alerts configured in cloud provider console

Disaster Recovery:

Multi-AZ GPU node pools (where supported)
Spot interruption handling tested
Model checkpoint strategies for long-running training jobs

The Self-Hosted Reality Check

Now that you understand what production GPU autoscaling requires, let’s be honest about what you’re signing up for.

The True Cost of Self-Hosted GPU Infrastructure

Time to Production: 3–6 weeks minimum for the following:

Karpenter/CA setup and testing
DCGM + Prometheus + Grafana configuration
KEDA or HPA tuning with custom metrics
MIG/time-slicing configuration
Cold start optimization strategies
Monitoring, alerting, and documentation

However once your infrastructure is setup and you are capable of handling AI/ML workloads you need to factor in also the ongoing maintenance burden for:

Kubernetes version upgrades (quarterly)
NVIDIA driver updates when new GPUs are added
DCGM exporter version compatibility issues
Karpenter/CA bug fixes and version updates
Prometheus storage management and cost optimization
On-call rotation for infrastructure issues (your scaling configuration tends to drift and perform poorly if the demands for your system grow rapidly)

Hidden Costs:

DevOps engineer time: 40–60% of one FTE minimum (~$60–80K annually)
Infrastructure waste from imperfect autoscaling: 15–30%
Cold start revenue loss during traffic spikes: hard to quantify, but real
Opportunity cost: engineers maintaining infrastructure instead of improving models

Common Pain Points Teams Hit

“Our DCGM metrics stopped flowing to Prometheus after a K8s upgrade”

This happens. Often. The debugging process involves checking: exporter pod logs, Prometheus service discovery, RBAC permissions, network policies, custom metrics API registration, and HPA/KEDA configurations.

Time to resolution: 2–4 hours if you’re lucky, 2 days if you’re not.

“Karpenter provisioned 10 massive instances during a spike, now we have a $15K surprise bill”

Autoscaling without proper limits is dangerous. You need sophisticated cost guardrails across multiple layers: provisioner limits, KEDA max replicas, cloud provider budgets, and real-time alerting.

“Our inference latency is terrible but GPU utilization is only at 40%”

GPU metrics are misleading. Your model might be CPU-bound (pre/post-processing), I/O-bound (data loading), or have batch size misconfigurations. Debugging requires deep expertise across the entire stack.

“Training jobs randomly fail because Spot instances get interrupted mid-run”

Spot instance handling requires: checkpoint/resume logic in your training code, node drain handlers, pod disruption budgets, and automatic job rescheduling configuration.

Each of these requires specialized knowledge.

Why Teams Are Choosing Serverless GPU Platforms

After reviewing what self-hosted GPU autoscaling truly requires, many teams, especially in small and growing startups, are reaching a simple conclusion: this isn’t our core competency, and it shouldn’t be.

The Serverless Alternative: GPU Shards

GPU Shards is a serverless inference platform that handles the entire “Golden Stack” behind the scenes. Instead of spending weeks configuring Karpenter, KEDA, and DCGM, you deploy your model and get a production-ready endpoint in minutes.

What GPU Shard Handles Automatically

1. Intelligent Autoscaling Without Configuration

No YAML. No ScaledObjects. No provisioners.

We scale based on actual request queue depth and P99 latency — not GPU utilization. When your queue starts growing, we provision capacity in under 500ms using pre-warmed GPU pools.

What this means: Handle traffic spikes without cold start delays or complex configuration.

2. GPU Multi-Tenancy Optimization

Upload your model. We analyze its memory footprint and automatically:

Provision MIG slices for small models (maximizing cost efficiency)
Enable time-slicing for compatible workloads
Allocate dedicated GPUs for large models

You get 7x better GPU utilization without touching the NVIDIA GPU Operator.

3. Built-in Observability

Every endpoint includes:

Request latency breakdown (queue → execution → network)
GPU memory usage and saturation metrics
Real-time cost per request
Error rates and automatic anomaly detection

No Prometheus setup. No Grafana configuration. No DCGM exporter debugging.

4. True Scale-to-Zero

Your endpoint scales down to zero during idle periods. Zero GPU costs at 3am. Zero baseline capacity waste.

When the first request arrives, we serve it from pre-warmed pools in <500ms — not 5 minutes.

5. Production-Grade Deployment Strategies

Blue-green deployments built in. Push a new model version. We:

Route 5% of traffic to the new version
Monitor error rates automatically
Roll back instantly if issues are detected
Gradually shift 100% of traffic if stable

No Kubernetes deployments to manage. No manual rollback procedures.

Real Cost Comparison: Medium Traffic Workload

Self-Hosted Monthly Costs:

4x A100 instances (24/7 baseline for cold start mitigation): $12,000
Kubernetes control plane and node overhead: $500
Monitoring infrastructure (Prometheus + Grafana): $300
Data transfer and storage: $200
DevOps engineer allocation (50% FTE): $10,000
Total: $23,000/month

GPU Shard Monthly Costs:

Pay-per-request inference: $6,000
Monitoring and logging: Included
Autoscaling and optimization: Included
DevOps overhead: $0
Cost during idle periods: $0
Total: $6,000/month

Savings: $17,000/month (74% reduction)

When Self-Hosted Still Makes Sense

GPU Shard isn’t right for every team. You should consider self-hosted if:

You have dedicated ML infrastructure engineers who specialize in GPU optimization
You need absolute control over every aspect of the stack for compliance
You’re running massive scale (>100 GPUs continuously) where managed overhead exceeds DIY savings
You have extremely specialized hardware requirements not yet supported by serverless platforms
Your models require custom CUDA kernels with non-standard configurations

For most teams — especially those under 50 GPUs or without dedicated infrastructure specialists — serverless is the pragmatic choice.

Conclusion: Focus on Models, Not Infrastructure

GPU autoscaling in production is a solved problem — but solving it yourself requires weeks of engineering time and ongoing maintenance burden.

The choice is clear:

Self-hosted: Own the complexity, control every detail, employ dedicated infrastructure specialists.
GPU Shards: Deploy in minutes, scale automatically, let your ML team focus on improving models instead of debugging Kubernetes.

For most teams, the answer is obvious.

Autoscaling K8s GPU Workloads in Production: A Complete Guide