How a GPU Memory Leak Manifests
Your GPU server starts fine. Inference works. But over hours — or sometimes minutes — VRAM consumption climbs steadily until you hit:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB
(GPU 0; 23.65 GiB total capacity; 22.89 GiB already allocated)
The telltale sign is that VRAM was not this full at startup. Something is allocating GPU memory and never releasing it. This is distinct from a model that simply requires more VRAM than available — a leak grows over time, proportional to the number of requests or training steps.
Common Sources of VRAM Leaks
On PyTorch GPU servers, memory leaks almost always come from Python-side issues, not from CUDA itself:
- Accumulating tensors in a list. Appending model outputs to a Python list without detaching them keeps the entire computation graph alive.
- Gradients not disabled during inference. Running forward passes without
torch.no_grad()creates gradient tensors that persist. - Global variables holding references. A logging dictionary or metrics tracker that stores tensors on GPU.
- Hooked callbacks retaining activations. PyTorch hooks capture intermediate tensors and can prevent garbage collection.
- DataLoader workers accumulating state. Persistent workers can hold onto GPU tensors across batches.
Detecting the Leak
Use this monitoring snippet to track VRAM allocation over time:
import torch
import gc
def report_gpu_memory(tag=""):
allocated = torch.cuda.memory_allocated() / 1e6
reserved = torch.cuda.memory_reserved() / 1e6
print(f"[{tag}] Allocated: {allocated:.1f} MB | Reserved: {reserved:.1f} MB")
# Call before and after each request/batch
report_gpu_memory("before_request")
# ... process request ...
report_gpu_memory("after_request")
If “Allocated” grows after each call and never drops, you have a leak. For production monitoring, integrate this with your GPU monitoring setup to catch leaks before they cause downtime.
Finding the specific leaking tensors
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
# List all tensors currently on GPU
for obj in gc.get_objects():
try:
if torch.is_tensor(obj) and obj.is_cuda:
print(f"{type(obj).__name__}: {obj.shape} {obj.dtype} {obj.device}")
except:
pass
This prints every GPU tensor that Python’s garbage collector knows about. Run it after a few inference calls — the tensors that keep growing in count are your leak.
Fixing the Most Common Leaks
Fix 1: Wrap inference in torch.no_grad()
@torch.no_grad()
def run_inference(model, inputs):
return model(inputs)
This single decorator prevents gradient computation, which is the number one cause of memory growth during inference on GPU servers.
Fix 2: Detach before collecting results
# BAD: keeps computation graph
results.append(model(x))
# GOOD: detach from graph and move to CPU
results.append(model(x).detach().cpu())
Fix 3: Delete references and clear cache
output = model(input_tensor)
result = output.detach().cpu().numpy()
del output, input_tensor
torch.cuda.empty_cache()
Fix 4: Use context managers for temporary tensors
with torch.cuda.device(0):
temp = torch.randn(1000, 1000, device='cuda')
result = some_operation(temp)
del temp
torch.cuda.empty_cache()
Production Leak Prevention
For long-running inference services on vLLM, Ollama, or custom PyTorch servers:
- Set a VRAM usage threshold alert. If usage exceeds 90 percent for more than five minutes, trigger a restart.
- Implement periodic
gc.collect()andtorch.cuda.empty_cache()between requests. - For Flask or FastAPI servers, avoid storing any tensor references as global state.
- Profile memory during load testing before going live — our tutorials cover load testing inference endpoints.
- Use Docker containers with resource limits as a safety net — the container gets killed and restarted if it exceeds its VRAM budget.
Automated VRAM Monitoring Script
#!/bin/bash
# vram_monitor.sh — alert if VRAM usage exceeds threshold
THRESHOLD=90
while true; do
USAGE=$(nvidia-smi --query-gpu=memory.used,memory.total \
--format=csv,noheader,nounits | \
awk -F', ' '{printf "%.0f", ($1/$2)*100}')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "WARNING: VRAM at ${USAGE}% — possible leak"
fi
sleep 30
done
Integrate this with your alerting system as part of a comprehensive GPU monitoring strategy.
Reliable GPU Servers for Production AI
GigaGPU dedicated servers with generous VRAM give your workloads room to breathe. Pair with proper monitoring for zero-downtime operation.
Browse GPU Servers