The Problem: VRAM Stays Allocated After Inference
You run inference on your GPU server, the model produces output, but nvidia-smi still shows the VRAM fully consumed:
$ nvidia-smi
+------------------+----------------------+
| GPU Memory Usage | |
| 20480MiB / 24576MiB |
+------------------+----------------------+
Your script finished, the model object should be gone, yet 20 GB of VRAM remains occupied. This blocks other workloads from using the GPU and defeats the purpose of having a multi-purpose dedicated server.
Why PyTorch Does Not Release GPU Memory
PyTorch uses a caching memory allocator. When you delete a tensor, PyTorch does not return the memory to CUDA immediately. Instead, it keeps the memory in its internal pool for reuse. This is a deliberate performance optimization — allocating GPU memory through the CUDA driver is slow, and reusing cached blocks is fast.
The consequence is that del model does not reduce the number shown in nvidia-smi. The memory is still held by the PyTorch process even though no Python objects reference it.
Three Levels of Memory Release
Level 1: Clear the PyTorch cache
import torch
import gc
# Delete the model and all tensors
del model
del outputs
# Force garbage collection
gc.collect()
# Release cached memory back to CUDA
torch.cuda.empty_cache()
# Check result
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
After empty_cache(), “Allocated” should drop to near zero. “Reserved” may remain higher because the caching allocator keeps its pools. nvidia-smi shows the “Reserved” value, not the “Allocated” value.
Level 2: Reset the CUDA context
If empty_cache is not enough — typically because hidden references or C++ extensions hold GPU memory — you can reset the CUDA device context entirely. Warning: this invalidates all existing CUDA tensors:
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
gc.collect()
torch.cuda.empty_cache()
# In extreme cases:
# torch.cuda.reset_accumulated_memory_stats()
Level 3: Kill the process
The only guaranteed way to release all GPU memory is to terminate the Python process. CUDA memory is tied to the process — when the process exits, all its GPU allocations are freed immediately.
# Find the PID using the GPU
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv
# Kill it
kill -9
Production Patterns for Memory Management
For PyTorch inference servers that load and unload models dynamically:
class ModelManager:
def __init__(self):
self.model = None
def load_model(self, model_path):
self.unload_model()
self.model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="auto"
)
def unload_model(self):
if self.model is not None:
self.model.cpu() # Move to CPU first
del self.model
self.model = None
gc.collect()
torch.cuda.empty_cache()
Moving the model to CPU before deletion ensures CUDA memory is freed from all parameter tensors. This pattern works well for servers that swap between models — common on shared GPU servers.
Framework-Specific Notes
- vLLM: manages its own memory pool via paged attention. Memory is released only when the vLLM process exits. Reconfiguring KV cache requires a restart.
- TensorFlow: set
tf.config.experimental.set_memory_growth(gpu, True)to prevent grabbing all VRAM at startup. Without this, TF allocates the entire GPU. - Ollama: handles model loading and unloading automatically. Idle models are evicted from VRAM after a timeout.
- Stable Diffusion: pipeline objects hold multiple sub-models (VAE, UNet, text encoder). Delete the entire pipeline, not just individual components.
Monitoring and Prevention
# Add to your inference loop
import torch
def log_gpu_memory(tag=""):
alloc = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
max_alloc = torch.cuda.max_memory_allocated() / 1e9
print(f"[{tag}] Alloc: {alloc:.2f}GB | Reserved: {reserved:.2f}GB | Peak: {max_alloc:.2f}GB")
Integrate this with your GPU monitoring setup. For containerised deployments, set GPU memory limits per container to prevent one workload from starving others.
Generous VRAM for Your Workloads
GigaGPU dedicated GPU servers offer 24 GB to 80 GB VRAM per card. Choose hardware that fits your memory requirements.
Browse GPU Servers