Long-running inference processes can leak VRAM. The model loads, serves for days, and memory creeps up until OOM. On our dedicated GPU hosting you have the tools to detect and diagnose these leaks without waiting for OOM.
Contents
Detection
Plot VRAM used over time in Grafana. Healthy serving shows a fixed baseline (model + allocated KV cache pool) with no trend. A leak shows a slow upward slope over hours or days.
DCGM_FI_DEV_FB_USED{gpu="0"}
Set an alert for trend regression – if VRAM grows more than X MB over 24 hours, alert.
Common Causes
- Leaked CUDA tensors not explicitly freed (Python GC holds refs)
- Growing prefix cache without eviction policy
- Model-specific bug in the serving framework version
- Driver memory fragmentation
- Long-lived client connections accumulating state
Diagnose
PyTorch memory snapshot (for custom inference code):
import torch
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... run workload ...
torch.cuda.memory._dump_snapshot("snapshot.pickle")
Visualise in PyTorch’s memory visualiser. Shows allocations over time with stack traces.
For vLLM or TGI, check the project issue tracker for known leaks in your version. Updating to the latest release fixes many reported leaks.
Mitigate
Short-term:
- Periodic restarts via systemd timer (once per day or per week)
- Tighten
--gpu-memory-utilizationto leave more headroom - Reduce prefix cache size
Long-term: fix the leak. File an issue, capture a memory snapshot, verify on the latest framework version.
Monitored GPU Hosting
DCGM-instrumented UK dedicated servers with VRAM trend alerting.
Browse GPU ServersSee DCGM Exporter and structured logging.