GPU Shards LogoShards
  • Home
  • Docs
  • Blog
Get Started

Getting Started

  • Introduction
  • Quick Start
  • Manual Installation

Guides

  • How GPU Sharing Works
  • Deploy a Container
  • Memory Limits & Shards

Reference

  • Troubleshooting
  • License
GPU Shards LogoShards

Carve one NVIDIA GPU into memory-isolated slices for multiple containers.

Product

  • Overview
  • Pricing
  • Marketplace
  • Features
  • Integrations

Company

  • About
  • Team
  • Blog
  • Careers
  • Contact

Support

  • Help center
  • Documentation
  • Status
  • Community

© 2026 GPU Shards. All rights reserved.

  • Terms and Conditions
  • Privacy Policy
Docs/Memory Limits & Shards

Memory Limits & Shards

How to size GPU memory shards and how the cap is enforced.

A shard is a fixed slice of GPU memory assigned to a container. The cap is set with the CUDA_DEVICE_MEMORY_LIMIT environment variable and enforced by the preloaded libvgpu library.

Setting a limit

The value accepts a size suffix:

# 4 GB
-e CUDA_DEVICE_MEMORY_LIMIT=4096m

# 1.5 GB
-e CUDA_DEVICE_MEMORY_LIMIT=1536m

# 12 GB
-e CUDA_DEVICE_MEMORY_LIMIT=12g

Inside the container, tools that query memory (including nvidia-smi run through the interposer, and CUDA's own memory APIs) report the shard size, not the physical card size.

Sizing guidance

  • Sum of all active shards should stay at or below the card's physical memory.
  • Leave headroom — drivers and the CUDA context themselves consume a few hundred MB per container.
  • Inference workloads are usually predictable; size to the model plus a margin.
  • Training is spikier; give it more headroom or it may hit the cap during peaks.

What happens at the limit

When an allocation would push a container past its shard, the CUDA call fails with an out-of-memory error — exactly as it would on a smaller physical GPU. The framework handles it the same way (for example, PyTorch raises CUDA out of memory). Other shards are unaffected.

Overcommitting

You can deploy shards that sum to more than the physical memory if you know they will not all peak at once. This trades the hard guarantee for density, so do it only when you understand the workloads. When in doubt, keep the total within the card.

Stuck? See Troubleshooting.

Previous
Deploy a Container
Next
Troubleshooting

On This Page

  • Setting a limit
  • Sizing guidance
  • What happens at the limit
  • Overcommitting