Tracking GPU Usage in K8s with Prometheus and DCGM: A Complete Guide
Date Published

If you’re running GPU workloads in the cloud, you know that visibility into GPU utilization is absolutely critical. GPUs are expensive resources, and understanding how they’re being used can mean the difference between an efficient, cost-effective deployment and one that’s burning through your budget unnecessarily. In this guide, we’ll walk through how to set up informative GPU monitoring in K8s using Prometheus and the NVIDIA’s DCGM (Data Center GPU Manager).
Why Monitor GPU Usage?
Before we dive into the technical details, let’s understand why GPU monitoring matters so much. When you’re running machine learning training jobs, inference services, or any GPU-accelerated workload, you need to answer several important questions. Are your GPUs actually being utilized, or are they sitting idle while you’re still paying for them? Is there a bottleneck elsewhere in your pipeline that’s preventing your GPUs from running at full capacity? Are you allocating the right GPU types for your workloads?
Without proper monitoring, you’re essentially flying blind. You might have a training job that’s taking days to complete, not because the model is complex, but because your GPU is only being utilized at twenty percent capacity due to slow data loading. Or you might be paying for high-end A100 GPUs when your workload could run just as well on more affordable T4s.
Understanding the Monitoring Stack
Let’s break down the components we’ll be using. Prometheus is an open-source monitoring system that collects metrics from configured targets at regular intervals. It stores these metrics as time-series data, which means you can track how values change over time and query historical data. Prometheus uses a pull model, where it actively scrapes metrics from endpoints you configure.
For GPU metrics specifically, we’ll use DCGM for GPU metrics collection. DCGM Exporter is a tool that exposes GPU metrics in a format that Prometheus can scrape. It queries NVIDIA GPUs using the DCGM library and presents the data as an HTTP endpoint that Prometheus can read.
Setting Up the Your Cluster
Let’s start by setting up a GPU-enabled K8s cluster where we’ll run workload and monitoring stack. For this you will need an NVIDIA GPU on your machine and minikube installed which you can either run at home or rent in the cloud.
For the example we are going to use a PC workstation with RTX 3060. Follow our tutorial on how to setup minikube and enable GPUs with cloud operator or watch the video instead:
When you install the NVIDIA GPU operator in your K8s cluster it should automatically setup the DCGM exporter which comes almost ready to handle metrics queries from Prometheus out of the box.
But before you visualize the metrics you’ll need to add two more component to the monitoring stack of you cluster — Prometheus and Grafana. The monitoring stack is already present as a helm chart so installing it in is as simple as running:
1helm install kube-prometheus-stack oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack --namespace monitoring --create-namespace
After running the command all of the resources of the stack will be present in the monitoring name. You can check what was deployed by running:
1kubectl get all -n monitoring

Configuring DCGM Exporter
Once Prometheus and DCGM are installed, you need to configure the exporter to expose the right metrics. The DCGM Exporter runs as a service that continuously queries your GPUs and presents the data on an HTTP endpoint, typically on port 9400.
The metrics exposed include GPU utilization percentage, memory usage both in absolute terms and as a percentage, GPU temperature, power consumption, and even more granular metrics like SM (Streaming Multiprocessor) occupancy and memory bandwidth utilization. These detailed metrics help you understand not just whether your GPU is busy, but how efficiently it’s being used.
When you run the DCGM Exporter, you can configure which metrics to collect and how frequently to update them. For most use cases, the default configuration provides a good balance between metric detail and system overhead.
By default the GPU metrics are not sent to Prometheus and you’ll need to create a Service Monitor to make them available for scraping. A service monitor is a custom resource from the Prometheus Operator that tells Prometheus how to discover and scrape metrics from your GPU Operator.
Considering that you’ve followed the video tutorial and installed the gpu-operator helm chart in the gpu-operator namespace, you can expose the metrics to Prometheus by running:
1kubectl apply -f - <<EOF2 apiVersion: monitoring.coreos.com/v13 kind: ServiceMonitor4 metadata:5 name: nvidia-dcgm-exporter6 namespace: gpu-operator7 labels:8 app: nvidia-dcgm-exporter9 release: kube-prometheus-stack10 spec:11 endpoints:12 - path: /metrics13 port: gpu-metrics14 jobLabel: app15 namespaceSelector:16 matchNames:17 - gpu-operator18 selector:19 matchLabels:20 app: nvidia-dcgm-exporter21EOF
Understanding the Metrics
Once Prometheus is collecting GPU metrics, you’ll have access to a wealth of information. Let’s check that our ServiceMonitor is working (you might have to wait couple of minutes after creating the ressource to collect some metrics). Open a new terminal and run:
1kubectl port-forward -n monitoring prometheus-kube-prometheus-stack-prometheus-0 39090:9090
The command will create a tunnel to your cluster and expose Prometheus on http://localhost:3909
Let’s check if the GPU operator metrics are exposed to Prometheus:
1curl -sG http://localhost:39090/api/v1/query \2 --data-urlencode 'query=up{job="nvidia-dcgm-exporter"}' \3 | jq .
If it was successful you’ll get similar output:

If you navigate to http://localhost:39090 from your browser you’ll see the WebUI where you can run metrics queries:

Let’s understand what some of the key metrics tell you.
Memory utilization metrics come in two forms. You have the absolute memory used in bytes, and you have the percentage of total GPU memory being used. These metrics help you understand if you’re memory-bound. If you see high memory usage but low compute utilization, it might indicate that your workload is spending too much time moving data around rather than computing.
Temperature and power metrics help you understand the physical state of your GPU. High temperatures might indicate cooling issues, while power consumption tells you about energy efficiency. If your GPU is drawing maximum power but showing low utilization, something is misconfigured.
Here is the full description of the metrics available for RTX3060:
1DCGM_FI_DEV_SM_CLOCK - Streaming multiprocessor clock frequency in MHz, which represents how fast the GPU's compute cores are running2DCGM_FI_DEV_MEM_CLOCK - Memory clock frequency in MHz, indicating the speed at which the GPU's memory operates3DCGM_FI_DEV_MEMORY_TEMP - Memory temperature in Celsius, measuring how hot the GPU's memory chips are running4DCGM_FI_DEV_GPU_TEMP - GPU core temperature in Celsius, showing the temperature of the main GPU die5DCGM_FI_DEV_POWER_USAGE - Instantaneous power draw in watts, telling you how much electricity the GPU is consuming right now6DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION - Cumulative energy consumption in millijoules since boot, tracking total power usage over time7DCGM_FI_DEV_PCIE_REPLAY_COUNTER - Total number of PCIe transaction retries, which indicates communication errors between the GPU and system8DCGM_FI_DEV_GPU_UTIL - GPU compute utilization as a percentage, showing how busy the processing cores are9DCGM_FI_DEV_MEM_COPY_UTIL - Memory bandwidth utilization as a percentage, indicating how intensively data is being moved to and from GPU memory10DCGM_FI_DEV_ENC_UTIL - Hardware encoder utilization as a percentage, measuring usage of the dedicated video encoding engine11DCGM_FI_DEV_DEC_UTIL - Hardware decoder utilization as a percentage, measuring usage of the dedicated video decoding engine12DCGM_FI_DEV_XID_ERRORS - Last XID error code encountered, where XID errors are NVIDIA's standardized GPU hardware error codes13DCGM_FI_DEV_FB_FREE - Free framebuffer memory in MiB, showing how much GPU memory is available for allocation14DCGM_FI_DEV_FB_USED - Used framebuffer memory in MiB, indicating how much GPU memory is actively allocated by applications15DCGM_FI_DEV_FB_RESERVED - Reserved framebuffer memory in MiB, showing memory set aside by the system and drivers16DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL - Total NVLink bandwidth counter across all lanes, measuring high-speed GPU-to-GPU communication throughput17DCGM_FI_DEV_VGPU_LICENSE_STATUS - Virtual GPU license status, indicating whether vGPU virtualization licensing is active
Visualizing with Grafana
While Prometheus collects and stores the metrics, Grafana is typically used for visualization. Grafana connects to Prometheus as a data source and lets you create dashboards that display your GPU metrics in intuitive graphs and charts.
Let’s portforward the Grafana WebUI to our localhost and check it out:
1kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3080:80
You’ll be prompted with a login page where you can use the default credentials if you haven’t configured custom once during the helm chart release:
Username: admin
Password: prom-operator

You would create a dashboard that shows GPU utilization over time as a line graph, allowing you to see patterns and trends. You might add a gauge showing current memory usage, with color coding to indicate when you’re approaching limits. Temperature graphs help you spot cooling issues, and power consumption charts help with capacity planning.
The real power comes from combining multiple metrics. You might create a panel that shows GPU utilization alongside your application-specific metrics like training loss or inference throughput. This correlation helps you understand the relationship between GPU usage and application performance.
Setting Up Alerts
Monitoring is only part of the solution. You also want to be alerted when things go wrong. Prometheus has a component called Alertmanager that can send notifications based on metric conditions you define.
You might create an alert that fires when GPU utilization drops below ten percent for more than five minutes, indicating that your expensive GPU is sitting idle. Or you could alert when GPU temperature exceeds 85 degrees Celsius, suggesting a cooling problem. Memory alerts can warn you when you’re approaching out-of-memory conditions that would crash your workload.
These alerts can be sent via email, Slack, PagerDuty, or numerous other channels, ensuring you’re notified about issues even when you’re not actively watching dashboards.
Optimizing Based on Metrics
The final and most important step is actually using these metrics to optimize your GPU usage. When you look at your dashboards, you’re looking for patterns and problems.
If you see consistently low GPU utilization, you need to investigate why. Common causes include slow data loading (your GPU is waiting for data), inefficient batch sizes (you’re not keeping the GPU fully occupied), or CPU bottlenecks in your preprocessing pipeline. Each of these problems has different solutions, and your metrics help you diagnose which one you’re facing.
If memory usage is constantly at maximum, you might need to reduce your batch size or optimize your model architecture. If you see high memory usage with low compute utilization, you might be memory-bandwidth bound, suggesting you should restructure your operations to be more compute-intensive.
Temperature and power metrics help with capacity planning and cost optimization. If your GPUs are consistently running hot and using maximum power but only showing moderate utilization, you might be able to use lower-tier GPUs and save costs.
Best Practices and Tips
Through practical experience, several best practices emerge for GPU monitoring. First, always start monitoring before you start your workload. You want baseline metrics to compare against, and you want to catch configuration issues early.
Set your scrape intervals appropriately. For interactive workloads and debugging, you might want very frequent scrapes every few seconds. For long-running training jobs, every 15–30 seconds is usually sufficient and reduces the overhead of monitoring itself.
Label everything consistently. When you have multiple GPU instances, proper labeling becomes essential for making sense of your metrics. Include labels for environment (dev, staging, production), workload type, team or project, and any other dimensions relevant to your organization.
Keep retention periods appropriate for your needs. Prometheus can store metrics for extended periods, but this consumes disk space. For most GPU workloads, keeping detailed metrics for a week or two and downsampled metrics for longer periods strikes a good balance.
The Case for Serverless GPU Infrastructure
Now that we’ve walked through setting up comprehensive GPU monitoring, let’s step back and consider what we’ve actually built. You’ve invested significant engineering time to get visibility into your GPU infrastructure, but monitoring is just the foundation. You still need to handle GPU driver updates, manage Kubernetes node pools, implement autoscaling policies, and coordinate hardware replacements when things fail. The monitoring system tells you about problems, but it doesn’t solve them.
The operational complexity compounds when you think about scalability. Your carefully configured monitoring will show you when GPUs are sitting idle at zero percent utilization, but you’re still paying for that capacity around the clock. During traffic spikes, you need to manually provision additional capacity, often with significant lead time. You end up either over-provisioning and wasting money, or under-provisioning and creating bottlenecks for your team.
This is where serverless GPU platforms like GPU-Shards fundamentally change the equation. Instead of building monitoring infrastructure and managing GPU pools, you deploy your model and the platform handles everything else. The monitoring, scaling, driver updates, and hardware management become abstracted away, freeing your team to focus on what actually differentiates your product. You get observability out of the box with metrics that matter for production inference serving.
Conclusion
If you want to learn, go the self-hosted path and implement GPU monitoring from the start. By understanding what your GPUs are actually doing, you can identify bottlenecks, right-size your infrastructure, and ensure you’re getting the most value from these expensive resources.
If you want to deploy scalable GPU inference endpoints fast and secure without having to carry the burden of monitoring and optimization — consider platforms like GPU-Shards which provide you with secure and scalable GPU infrastructure for cents.

Install NVIDIA GPU Operator on Kubernetes; enable time-slicing on RTX 3060 to partition GPUs, boost inference utilization, and cut cloud costs.

Discover the critical GPU pitfalls in self-hosting AI workloads. Learn about hardware issues, infrastructure challenges, and managed alternatives.