RTX 3050 - Order Now
Home / Blog / Tutorials / GPU Memory Leak: Detecting and Fixing VRAM Leaks
Tutorials

GPU Memory Leak: Detecting and Fixing VRAM Leaks

Detect and fix GPU memory leaks that cause VRAM usage to grow over time. Covers PyTorch reference leaks, gradient accumulation bugs, and monitoring strategies for production GPU servers.

How a GPU Memory Leak Manifests

Your GPU server starts fine. Inference works. But over hours — or sometimes minutes — VRAM consumption climbs steadily until you hit:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB
(GPU 0; 23.65 GiB total capacity; 22.89 GiB already allocated)

The telltale sign is that VRAM was not this full at startup. Something is allocating GPU memory and never releasing it. This is distinct from a model that simply requires more VRAM than available — a leak grows over time, proportional to the number of requests or training steps.

Common Sources of VRAM Leaks

On PyTorch GPU servers, memory leaks almost always come from Python-side issues, not from CUDA itself:

  • Accumulating tensors in a list. Appending model outputs to a Python list without detaching them keeps the entire computation graph alive.
  • Gradients not disabled during inference. Running forward passes without torch.no_grad() creates gradient tensors that persist.
  • Global variables holding references. A logging dictionary or metrics tracker that stores tensors on GPU.
  • Hooked callbacks retaining activations. PyTorch hooks capture intermediate tensors and can prevent garbage collection.
  • DataLoader workers accumulating state. Persistent workers can hold onto GPU tensors across batches.

Detecting the Leak

Use this monitoring snippet to track VRAM allocation over time:

import torch
import gc

def report_gpu_memory(tag=""):
    allocated = torch.cuda.memory_allocated() / 1e6
    reserved = torch.cuda.memory_reserved() / 1e6
    print(f"[{tag}] Allocated: {allocated:.1f} MB | Reserved: {reserved:.1f} MB")

# Call before and after each request/batch
report_gpu_memory("before_request")
# ... process request ...
report_gpu_memory("after_request")

If “Allocated” grows after each call and never drops, you have a leak. For production monitoring, integrate this with your GPU monitoring setup to catch leaks before they cause downtime.

Finding the specific leaking tensors

import torch
import gc

gc.collect()
torch.cuda.empty_cache()

# List all tensors currently on GPU
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) and obj.is_cuda:
            print(f"{type(obj).__name__}: {obj.shape} {obj.dtype} {obj.device}")
    except:
        pass

This prints every GPU tensor that Python’s garbage collector knows about. Run it after a few inference calls — the tensors that keep growing in count are your leak.

Fixing the Most Common Leaks

Fix 1: Wrap inference in torch.no_grad()

@torch.no_grad()
def run_inference(model, inputs):
    return model(inputs)

This single decorator prevents gradient computation, which is the number one cause of memory growth during inference on GPU servers.

Fix 2: Detach before collecting results

# BAD: keeps computation graph
results.append(model(x))

# GOOD: detach from graph and move to CPU
results.append(model(x).detach().cpu())

Fix 3: Delete references and clear cache

output = model(input_tensor)
result = output.detach().cpu().numpy()
del output, input_tensor
torch.cuda.empty_cache()

Fix 4: Use context managers for temporary tensors

with torch.cuda.device(0):
    temp = torch.randn(1000, 1000, device='cuda')
    result = some_operation(temp)
    del temp
    torch.cuda.empty_cache()

Production Leak Prevention

For long-running inference services on vLLM, Ollama, or custom PyTorch servers:

  • Set a VRAM usage threshold alert. If usage exceeds 90 percent for more than five minutes, trigger a restart.
  • Implement periodic gc.collect() and torch.cuda.empty_cache() between requests.
  • For Flask or FastAPI servers, avoid storing any tensor references as global state.
  • Profile memory during load testing before going live — our tutorials cover load testing inference endpoints.
  • Use Docker containers with resource limits as a safety net — the container gets killed and restarted if it exceeds its VRAM budget.

Automated VRAM Monitoring Script

#!/bin/bash
# vram_monitor.sh — alert if VRAM usage exceeds threshold
THRESHOLD=90
while true; do
    USAGE=$(nvidia-smi --query-gpu=memory.used,memory.total \
            --format=csv,noheader,nounits | \
            awk -F', ' '{printf "%.0f", ($1/$2)*100}')
    if [ "$USAGE" -gt "$THRESHOLD" ]; then
        echo "WARNING: VRAM at ${USAGE}% — possible leak"
    fi
    sleep 30
done

Integrate this with your alerting system as part of a comprehensive GPU monitoring strategy.

Reliable GPU Servers for Production AI

GigaGPU dedicated servers with generous VRAM give your workloads room to breathe. Pair with proper monitoring for zero-downtime operation.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?