Home / Blog / Tutorials / Python GPU Memory Not Released After Inference: Fix

Tutorials

Python GPU Memory Not Released After Inference: Fix

Fix GPU VRAM not being freed after Python inference completes. Covers PyTorch caching allocator behaviour, proper tensor cleanup, process-level memory release, and production patterns.

Tutorials April 16, 2026 3 min read gigagpu

The Problem: VRAM Stays Allocated After Inference

You run inference on your GPU server, the model produces output, but nvidia-smi still shows the VRAM fully consumed:

$ nvidia-smi
+------------------+----------------------+
| GPU Memory Usage |                      |
|  20480MiB / 24576MiB                    |
+------------------+----------------------+

Your script finished, the model object should be gone, yet 20 GB of VRAM remains occupied. This blocks other workloads from using the GPU and defeats the purpose of having a multi-purpose dedicated server.

Why PyTorch Does Not Release GPU Memory

PyTorch uses a caching memory allocator. When you delete a tensor, PyTorch does not return the memory to CUDA immediately. Instead, it keeps the memory in its internal pool for reuse. This is a deliberate performance optimization — allocating GPU memory through the CUDA driver is slow, and reusing cached blocks is fast.

The consequence is that del model does not reduce the number shown in nvidia-smi. The memory is still held by the PyTorch process even though no Python objects reference it.

Three Levels of Memory Release

Level 1: Clear the PyTorch cache

import torch
import gc

# Delete the model and all tensors
del model
del outputs

# Force garbage collection
gc.collect()

# Release cached memory back to CUDA
torch.cuda.empty_cache()

# Check result
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")

After empty_cache(), “Allocated” should drop to near zero. “Reserved” may remain higher because the caching allocator keeps its pools. nvidia-smi shows the “Reserved” value, not the “Allocated” value.

Level 2: Reset the CUDA context

If empty_cache is not enough — typically because hidden references or C++ extensions hold GPU memory — you can reset the CUDA device context entirely. Warning: this invalidates all existing CUDA tensors:

torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
gc.collect()
torch.cuda.empty_cache()
# In extreme cases:
# torch.cuda.reset_accumulated_memory_stats()

Level 3: Kill the process

The only guaranteed way to release all GPU memory is to terminate the Python process. CUDA memory is tied to the process — when the process exits, all its GPU allocations are freed immediately.

# Find the PID using the GPU
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv
# Kill it
kill -9

Production Patterns for Memory Management

For PyTorch inference servers that load and unload models dynamically:

class ModelManager:
    def __init__(self):
        self.model = None

    def load_model(self, model_path):
        self.unload_model()
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path, device_map="auto"
        )

    def unload_model(self):
        if self.model is not None:
            self.model.cpu()  # Move to CPU first
            del self.model
            self.model = None
            gc.collect()
            torch.cuda.empty_cache()

Moving the model to CPU before deletion ensures CUDA memory is freed from all parameter tensors. This pattern works well for servers that swap between models — common on shared GPU servers.

Framework-Specific Notes

vLLM: manages its own memory pool via paged attention. Memory is released only when the vLLM process exits. Reconfiguring KV cache requires a restart.
TensorFlow: set tf.config.experimental.set_memory_growth(gpu, True) to prevent grabbing all VRAM at startup. Without this, TF allocates the entire GPU.
Ollama: handles model loading and unloading automatically. Idle models are evicted from VRAM after a timeout.
Stable Diffusion: pipeline objects hold multiple sub-models (VAE, UNet, text encoder). Delete the entire pipeline, not just individual components.

Monitoring and Prevention

# Add to your inference loop
import torch

def log_gpu_memory(tag=""):
    alloc = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    max_alloc = torch.cuda.max_memory_allocated() / 1e9
    print(f"[{tag}] Alloc: {alloc:.2f}GB | Reserved: {reserved:.2f}GB | Peak: {max_alloc:.2f}GB")

Integrate this with your GPU monitoring setup. For containerised deployments, set GPU memory limits per container to prevent one workload from starving others.

Generous VRAM for Your Workloads

GigaGPU dedicated GPU servers offer 24 GB to 80 GB VRAM per card. Choose hardware that fits your memory requirements.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Python GPU Memory Not Released After Inference: Fix

The Problem: VRAM Stays Allocated After Inference

Why PyTorch Does Not Release GPU Memory

Three Levels of Memory Release

Level 1: Clear the PyTorch cache

Level 2: Reset the CUDA context

Level 3: Kill the process

Production Patterns for Memory Management

Framework-Specific Notes

Monitoring and Prevention

Generous VRAM for Your Workloads

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Python GPU Memory Not Released After Inference: Fix

The Problem: VRAM Stays Allocated After Inference

Why PyTorch Does Not Release GPU Memory

Three Levels of Memory Release

Level 1: Clear the PyTorch cache

Level 2: Reset the CUDA context

Level 3: Kill the process

Production Patterns for Memory Management

Framework-Specific Notes

Monitoring and Prevention

Generous VRAM for Your Workloads

Need a Dedicated GPU Server?

gigagpu

Related Articles

Synthetic Training Data Generation – Self-Hosted Pipeline

AI Runtime Tracing with OpenTelemetry

Connect JetBrains IDE to Self-Hosted AI on GPU

LoRA vs QLoRA vs Full Fine-Tuning: GPU Requirements

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?