NVIDIA GPUs on Kubernetes: How They Work Under the Hood

Author

Nikolay Penkov

Date Published

Modern AI, ML, and compute workloads rely heavily on GPU acceleration. While NVIDIA makes GPU support in Docker relatively straightforward, configuring GPUs with containerd (the default runtime in Kubernetes and Minikube) requires a few careful steps.

This guide will help you understanding how GPUs are handled under the hood in Kubernetes and walks you through installing NVIDIA drivers, configuring the NVIDIA Container Toolkit and integrating it with containerd.

The big picture

The Kubernetes device plugin advertises GPUs as extended resources (like nvidia.com/gpu, or nvidia.com/mig-1g.5gb for MIG). Kubelet updates node capacity; the scheduler picks nodes that satisfy requests. Afterwards at start time kubelet asks the device plugin to Allocate, then hands the container runtime a ready-to-run spec.

On the node, you need the NVIDIA kernel driver and the NVIDIA Container Toolkit. The toolkit provides nvidia-container-runtime and the hooks/CLI (libnvidia-container) that inject device nodes, libraries, and env into the container.

Newer clusters commonly rely on CDI (Container Device Interface) so the runtime can add GPU access using standard device descriptions. Using the NVIDIA CDI this reduces the need for a special runtime class.

Many teams install everything via the NVIDIA GPU Operator, which automates drivers, the toolkit, the device plugin, GPU Feature Discovery (node labels), DCGM exporter, and optional MIG manager. You can check our blog post that goes in detail how to setup the GPU operator on minikube.

What gets exposed to pods?

Resource model seen by Kubernetes

Extended resources on the node:

  • Whole GPUs: nvidia.com/gpu
  • MIG slices (A100/H100, etc.): nvidia.com/mig-1g.5gb, nvidia.com/mig-2g.10gb, … (depends on how the GPU is partitioned).

You request/limit these the same way you request CPUs/memory. The scheduler only places the pod where capacity exists.

Inside the container

Character devices: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, etc.

CUDA/NVML libraries mounted in, plus env like NVIDIA_VISIBLE_DEVICES (older path) or CDI device references (newer path). These are wired in by the NVIDIA container toolkit (runtime/hook/CLI) or by CDI-aware runtimes.

Install Nvidia GPU drivers

Before installing anything, ensure your GPU is recognized by Ubuntu. You can check for the latest driver version with the following command:

1sudo ubuntu-drivers list

You will see something like:

1nvidia-driver-470
2nvidia-driver-470-server
3nvidia-driver-535
4nvidia-driver-535-open
5nvidia-driver-535-server
6nvidia-driver-535-server-open
7nvidia-driver-550
8nvidia-driver-550-open
9nvidia-driver-550-server
10nvidia-driver-550-server-open

Pick a valid version for your system and install the appropriate NVIDIA Driver. We are going to let it be handled automatically with the following command:

1sudo ubuntu-drivers install

Verify GPU setup

After the installation finishes reboot your system and check if NVIDIA is working correctly:

1nvidia-smi

This command reports GPU usage statistics like temperature, memory, and power consumption, and can be also used to control GPU settings such as power limits and compute modes. If you see something like this you can ensure that the installation was successfull:

1+-----------------------------------------------------------------------------------------+
2| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
3+-----------------------------------------+------------------------+----------------------+
4| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
5| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
6| | | MIG M. |
7|=========================================+========================+======================|
8| 0 NVIDIA GeForce RTX 3060 Off | 00000000:26:00.0 On | N/A |
9| 0% 44C P8 13W / 170W | 9MiB / 12288MiB | 0% Default |
10| | | N/A |
11+-----------------------------------------+------------------------+----------------------+
12
13+-----------------------------------------------------------------------------------------+
14| Processes: |
15| GPU GI CI PID Type Process name GPU Memory |
16| ID ID Usage |
17|=========================================================================================|
18| No running processes found |
19+-----------------------------------------------------------------------------------------+

Verify Kernel Modules and Driver Files

NVIDIA kernel modules are drivers that connect the Linux kernel to NVIDIA GPU hardware, enabling graphics rendering and GPU computation. The main module nvidia manages core GPU operations and communication with user-space tools. Supporting modules like nvidia_modeset and nvidia_drm handle display configuration and integrate with the Linux graphics stack, while nvidia_uvm provides unified memory access for CUDA workloads. Together, they ensure your system can fully utilize the GPU for both display and compute tasks.

Check loaded NVIDIA kernel modules

After installing the NVIDIA drivers, it’s important to verify that the kernel modules have been correctly loaded. These modules form the critical link between the Linux kernel and your GPU hardware — without them, the GPU won’t be accessible to container runtimes or CUDA applications.

To check that the modules are loaded, run:

1lsmod | grep nvidia

If the kernel modules are loaded correctly you should see something like:

1nvidia_uvm 2179072 28
2nvidia_drm 139264 0
3nvidia_modeset 1814528 1 nvidia_drm
4nvidia 14381056 36 nvidia_uvm,nvidia_modeset

Check driver version

Verifying the installed driver version ensures compatibility with your GPU hardware and the CUDA or container runtime you plan to use. To check the current NVIDIA driver version, run:

1cat /proc/driver/nvidia/version

This command displays details about the loaded driver, including its version number and build information. Confirm that it matches the version you intended to install — mismatched or outdated drivers can cause issues with GPU detection or containerized workloads. Let's check the output of proper configuration below:

1NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.95.05 Release Build (dvs-builder@U22-I3-B17-02-5) Tue Sep 23 09:55:41 UTC 2025
2GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

Install the Container Runtime CLI (crictl)

The Container Runtime CLI is a lightweight command-line tool for interacting directly with container runtimes such as containerd or CRI-O. It’s especially useful for debugging Kubernetes nodes, checking container statuses, and inspecting images when kubectl alone isn’t sufficient. You can think of it as the equivalent of docker for low-level container runtimes.

While docker interacts with the Docker Engine (which bundles its own runtime and tooling), crictl communicates directly with CRI-compatible runtimes like containerd or CRI-O — the same runtimes Kubernetes uses under the hood.

Let's install crictl v1.34.0 with:

1VERSION="v1.34.0"
2wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-$VERSION-linux-amd64.tar.gz
3sudo tar zxvf crictl-$VERSION-linux-amd64.tar.gz -C /usr/local/bin
4rm -f crictl-$VERSION-linux-amd64.tar.gz

Verify it with:

1sudo crictl info | grep runtimeType

You’ll get output showing which container runtime crictl is connected to:

1// On a system using containerd, you’ll typically see
2"runtimeType": "io.containerd.runc.v2"
3
4// If you’re using a different runtime like CRI-O it might show something like
5"runtimeType": "cri-o"

Install the NVIDIA Container Toolkit

The NVIDIA Container Toolkit provides container runtimes (hooks and libraries) that make GPUs visible inside containers.

Follow NVIDIA’s official instructions:

1# Add NVIDIA repo and key
2curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
3| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
4
5curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
6| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
7| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
8
9# Update and install
10sudo apt update
11sudo apt install -y nvidia-container-toolkit

Configure containerd to Use the NVIDIA Runtime

By default, containerd starts containers with a generic runtime that doesn’t expose GPUs. NVIDIA GPU support requires a special runtime layer (from the NVIDIA Container Toolkit) that:

- Mounts GPU device nodes (e.g., /dev/nvidia0, /dev/nvidia-uvm) into containers.

- Injects NVIDIA user-space libraries (libcuda, libnvidia-ml, etc.) and driver-matched components.

- Applies the right OCI hooks and cgroup settings so CUDA and drivers work reliably and securely inside containers.

1# Configure containerd to recognize the NVIDIA runtime
2sudo nvidia-ctk runtime configure --runtime=containerd
3
4# Restart containerd
5sudo systemctl restart containerd
6sudo systemctl status containerd
7
8# (Optional) Make NVIDIA the default runtime
9sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default

Lastly we can check if the NVIDIA runtime was added by running the command below:

1sudo cat /etc/containerd/config.toml | grep "containerd.runtimes.nvidia"

Expected output:

1 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
2 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
3 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
4 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
5 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
6 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]


Reference

* [NVIDIA Container Toolkit Docs](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

* [NVIDIA GPU Operator GitHub](https://github.com/NVIDIA/gpu-operator)

* [NVIDIA CUDA Docker Hub](https://hub.docker.com/r/nvidia/cuda/tags)

* [Enabling GPUs in the Container Runtime Ecosystem (NVIDIA Blog)](https://developer.nvidia.com/blog/gpu-containers-runtime/)

* [Kubernetes Device Plugin Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)



We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.

Learn more