RTX 3050 - Order Now
Home / Blog / Tutorials / CUDA Out of Memory Error: How to Fix OOM on GPU Servers
Tutorials

CUDA Out of Memory Error: How to Fix OOM on GPU Servers

Fix the RuntimeError CUDA out of memory error on GPU servers. Step-by-step guide to diagnose, resolve, and prevent OOM crashes in PyTorch and TensorFlow workloads.

The OOM Error Message You Are Seeing

You have hit something like this in your terminal:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 23.65 GiB total capacity; 21.83 GiB already allocated;
512.00 MiB free; 22.14 GiB reserved in total by PyTorch)

This error means your GPU’s VRAM has been exhausted. The process tried to request an additional block of memory that simply does not fit. It is the single most common complaint on PyTorch GPU servers and TensorFlow GPU servers alike, and it almost always has a concrete, fixable cause.

Why CUDA Out of Memory Happens

VRAM is finite. On a dedicated GPU server, you have a fixed pool of high-bandwidth memory attached to each GPU. Several factors push consumption beyond that limit:

  • Model size exceeds available VRAM. Loading a 13-billion-parameter model in FP16 requires roughly 26 GB. An RTX 5090 has 24 GB.
  • Batch size is too large. Activations scale linearly with batch size. Doubling the batch roughly doubles activation memory.
  • Memory fragmentation. PyTorch’s caching allocator may hold blocks it cannot reuse, causing allocation failures even when aggregate free memory looks sufficient.
  • Accumulated gradients or tensors. Forgotten references prevent garbage collection.

Understanding which factor applies to your workload determines the fix. Check your current usage with nvidia-smi before making changes — our guide on monitoring GPU usage covers this in depth.

Step-by-Step Fix for CUDA OOM

1. Check real-time VRAM consumption

watch -n 1 nvidia-smi

Look at the “Memory-Usage” column. If another process is eating VRAM, kill it or move it to a different GPU.

2. Reduce batch size

This is the fastest fix. Cut your batch size in half and re-run. If the job succeeds, gradually increase until you find the sweet spot. For inference-only workloads on vLLM, the --max-num-batched-tokens flag controls this directly.

3. Enable mixed precision

from torch.cuda.amp import autocast
with autocast():
    output = model(input_batch)

Mixed precision halves memory consumption for activations while preserving model quality. See our tutorials section for precision format comparisons.

4. Use gradient checkpointing (training only)

model.gradient_checkpointing_enable()

This trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them.

5. Clear the PyTorch memory cache

import torch
torch.cuda.empty_cache()

This releases all unused cached memory back to the GPU. It does not free tensors that are still referenced.

6. Move to a larger GPU

Sometimes the model simply needs more VRAM. GigaGPU offers servers with 48 GB (RTX 6000 Pro), 80 GB (RTX 6000 Pro), and multi-GPU configurations. Browse available GPU servers to find a match for your workload.

Verifying the Fix Worked

After applying one or more of the steps above, confirm success:

python -c "
import torch
print(f'Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB')
print(f'Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB')
print(f'Max Alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB')
"

If max_memory_allocated stays well below your GPU’s total VRAM, the OOM is resolved. For ongoing monitoring, set up persistent tracking as described in our GPU monitoring guide.

Preventing Future OOM Crashes

Adopt these practices on your dedicated GPU server to avoid recurring out-of-memory errors:

  • Profile peak memory before deploying to production using torch.cuda.max_memory_allocated().
  • Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation on PyTorch 2.0+.
  • For inference pipelines, quantize models to 4-bit or 8-bit with bitsandbytes — this can cut VRAM usage by 75 percent.
  • When running Docker GPU workloads, set per-container memory limits with --gpus device=0 --shm-size=8g.
  • If you run multiple models, consider Ollama for automatic memory management across models.

When the Real Fix Is More VRAM

Some workloads genuinely need more GPU memory than a single consumer card provides. If you are running large language models, multi-image generation batches, or training runs that exceed 24 GB even after optimization, it is time to look at professional-grade hardware. Our vLLM memory optimization guide covers software-side tuning, but hardware limits are hardware limits.

Stop Fighting OOM Errors

GigaGPU dedicated servers come with 24 GB to 80 GB of VRAM per GPU. Pick the right hardware and leave memory headaches behind.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?