How GPU Sharing Works
The mechanism behind isolated GPU memory shards.
GPU Shards does not virtualize the GPU at the driver level. Instead, it intercepts the CUDA calls a container makes and enforces a memory budget on them. This keeps the host driver untouched and lets you run stock CUDA images.
The libvgpu interposer
Each sharded container starts with libvgpu.so preloaded via LD_PRELOAD. This
library, from Project-HAMi, sits between
the application and the real CUDA driver:
your app → libvgpu.so → NVIDIA driver → GPU
│
└─ enforces CUDA_DEVICE_MEMORY_LIMIT
When the app allocates memory, libvgpu checks it against the configured limit. Calls
that would exceed the shard fail just as they would on a smaller physical card — so
frameworks behave normally, they simply see less memory.
Why no Kubernetes
Most GPU-partitioning setups rely on Kubernetes device plugins and MIG. GPU Shards targets a single host: the backend talks directly to the Docker daemon, sets the right environment variables, and starts the container. That removes a large operational surface for teams that just want to share one box.
What isolation does and does not cover
- Memory — Hard-capped per container. This is the primary isolation guarantee.
- Compute — Shared cooperatively; a container can use spare SM cycles when others are idle.
- Faults — A crash inside one container does not affect the host driver or other shards.
- Hardware — Not isolated. Every shard runs on the same physical GPU through the same driver. GPU Shards isolates GPU memory, not the silicon — it is not hardware-level partitioning like NVIDIA MIG.
Compared to GPU Operator time-slicing
The usual way to share a GPU on Kubernetes is the NVIDIA GPU Operator with time-slicing. That carves up compute time between replicas, but it does not partition memory: every container scheduled onto a time-sliced GPU sees the card's full memory, and nothing stops one of them from allocating all of it.
Example — two containers on a 24 GB card:
- Time-slicing: both containers see all 24 GB. If the first one leaks or spikes to 20 GB, the second one's next allocation fails with a CUDA out-of-memory error and the process crashes — even though it "should" have had its own half. There is no per-container budget protecting it.
- GPU Shards: each container gets a hard 12 GB cap. The first container's allocation fails inside its own 12 GB budget the moment it tries to exceed it, and the second container keeps running with its 12 GB untouched.
So you get predictable per-tenant memory and noisy-neighbor protection without MIG-capable hardware, a Kubernetes cluster, or driver changes. The trade-off is that compute is still shared cooperatively and the underlying card is not hardware-isolated — if you need that, MIG is the right tool.
For details on choosing limits, see Memory Limits & Shards.