Memory Limits & Shards
How to size GPU memory shards and how the cap is enforced.
A shard is a fixed slice of GPU memory assigned to a container. The cap is set with the
CUDA_DEVICE_MEMORY_LIMIT environment variable and enforced by the preloaded
libvgpu library.
Setting a limit
The value accepts a size suffix:
# 4 GB
-e CUDA_DEVICE_MEMORY_LIMIT=4096m
# 1.5 GB
-e CUDA_DEVICE_MEMORY_LIMIT=1536m
# 12 GB
-e CUDA_DEVICE_MEMORY_LIMIT=12g
Inside the container, tools that query memory (including nvidia-smi run through the
interposer, and CUDA's own memory APIs) report the shard size, not the physical
card size.
Sizing guidance
- Sum of all active shards should stay at or below the card's physical memory.
- Leave headroom — drivers and the CUDA context themselves consume a few hundred MB per container.
- Inference workloads are usually predictable; size to the model plus a margin.
- Training is spikier; give it more headroom or it may hit the cap during peaks.
What happens at the limit
When an allocation would push a container past its shard, the CUDA call fails with an
out-of-memory error — exactly as it would on a smaller physical GPU. The framework
handles it the same way (for example, PyTorch raises CUDA out of memory). Other
shards are unaffected.
Overcommitting
You can deploy shards that sum to more than the physical memory if you know they will not all peak at once. This trades the hard guarantee for density, so do it only when you understand the workloads. When in doubt, keep the total within the card.
Stuck? See Troubleshooting.