Docs/Memory Limits & Shards

Memory Limits & Shards

How to size GPU memory shards and how the cap is enforced.

A shard is a fixed slice of GPU memory assigned to a container. The cap is set with the CUDA_DEVICE_MEMORY_LIMIT environment variable and enforced by the preloaded libvgpu library.

Setting a limit

The value accepts a size suffix:

# 4 GB
-e CUDA_DEVICE_MEMORY_LIMIT=4096m

# 1.5 GB
-e CUDA_DEVICE_MEMORY_LIMIT=1536m

# 12 GB
-e CUDA_DEVICE_MEMORY_LIMIT=12g

Inside the container, tools that query memory (including nvidia-smi run through the interposer, and CUDA's own memory APIs) report the shard size, not the physical card size.

Sizing guidance

Sum of all active shards should stay at or below the card's physical memory.
Leave headroom — drivers and the CUDA context themselves consume a few hundred MB per container.
Inference workloads are usually predictable; size to the model plus a margin.
Training is spikier; give it more headroom or it may hit the cap during peaks.

When an allocation would push a container past its shard, the CUDA call fails with an out-of-memory error — exactly as it would on a smaller physical GPU. The framework handles it the same way (for example, PyTorch raises CUDA out of memory). Other shards are unaffected.

Overcommitting

You can deploy shards that sum to more than the physical memory if you know they will not all peak at once. This trades the hard guarantee for density, so do it only when you understand the workloads. When in doubt, keep the total within the card.

Stuck? See Troubleshooting.

Deploy a Container

Troubleshooting