The OOM Error Message You Are Seeing
You have hit something like this in your terminal:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 23.65 GiB total capacity; 21.83 GiB already allocated;
512.00 MiB free; 22.14 GiB reserved in total by PyTorch)
This error means your GPU’s VRAM has been exhausted. The process tried to request an additional block of memory that simply does not fit. It is the single most common complaint on PyTorch GPU servers and TensorFlow GPU servers alike, and it almost always has a concrete, fixable cause.
Why CUDA Out of Memory Happens
VRAM is finite. On a dedicated GPU server, you have a fixed pool of high-bandwidth memory attached to each GPU. Several factors push consumption beyond that limit:
- Model size exceeds available VRAM. Loading a 13-billion-parameter model in FP16 requires roughly 26 GB. An RTX 5090 has 24 GB.
- Batch size is too large. Activations scale linearly with batch size. Doubling the batch roughly doubles activation memory.
- Memory fragmentation. PyTorch’s caching allocator may hold blocks it cannot reuse, causing allocation failures even when aggregate free memory looks sufficient.
- Accumulated gradients or tensors. Forgotten references prevent garbage collection.
Understanding which factor applies to your workload determines the fix. Check your current usage with nvidia-smi before making changes — our guide on monitoring GPU usage covers this in depth.
Step-by-Step Fix for CUDA OOM
1. Check real-time VRAM consumption
watch -n 1 nvidia-smi
Look at the “Memory-Usage” column. If another process is eating VRAM, kill it or move it to a different GPU.
2. Reduce batch size
This is the fastest fix. Cut your batch size in half and re-run. If the job succeeds, gradually increase until you find the sweet spot. For inference-only workloads on vLLM, the --max-num-batched-tokens flag controls this directly.
3. Enable mixed precision
from torch.cuda.amp import autocast
with autocast():
output = model(input_batch)
Mixed precision halves memory consumption for activations while preserving model quality. See our tutorials section for precision format comparisons.
4. Use gradient checkpointing (training only)
model.gradient_checkpointing_enable()
This trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them.
5. Clear the PyTorch memory cache
import torch
torch.cuda.empty_cache()
This releases all unused cached memory back to the GPU. It does not free tensors that are still referenced.
6. Move to a larger GPU
Sometimes the model simply needs more VRAM. GigaGPU offers servers with 48 GB (RTX 6000 Pro), 80 GB (RTX 6000 Pro), and multi-GPU configurations. Browse available GPU servers to find a match for your workload.
Verifying the Fix Worked
After applying one or more of the steps above, confirm success:
python -c "
import torch
print(f'Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB')
print(f'Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB')
print(f'Max Alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB')
"
If max_memory_allocated stays well below your GPU’s total VRAM, the OOM is resolved. For ongoing monitoring, set up persistent tracking as described in our GPU monitoring guide.
Preventing Future OOM Crashes
Adopt these practices on your dedicated GPU server to avoid recurring out-of-memory errors:
- Profile peak memory before deploying to production using
torch.cuda.max_memory_allocated(). - Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueto reduce fragmentation on PyTorch 2.0+. - For inference pipelines, quantize models to 4-bit or 8-bit with bitsandbytes — this can cut VRAM usage by 75 percent.
- When running Docker GPU workloads, set per-container memory limits with
--gpus device=0 --shm-size=8g. - If you run multiple models, consider Ollama for automatic memory management across models.
When the Real Fix Is More VRAM
Some workloads genuinely need more GPU memory than a single consumer card provides. If you are running large language models, multi-image generation batches, or training runs that exceed 24 GB even after optimization, it is time to look at professional-grade hardware. Our vLLM memory optimization guide covers software-side tuning, but hardware limits are hardware limits.
Stop Fighting OOM Errors
GigaGPU dedicated servers come with 24 GB to 80 GB of VRAM per GPU. Pick the right hardware and leave memory headaches behind.
Browse GPU Servers