Symptom: Generation Crashes With CUDA OOM
You queue an image generation on your GPU server and Stable Diffusion dies mid-process with a wall of red text:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 23.65 GiB of which 1.12 GiB is free.
This happens most often when generating at high resolutions, using SDXL or Flux models, or running with batch sizes greater than one. The model, VAE, and intermediate tensors compete for the same VRAM pool, and one of them loses.
Immediate VRAM Recovery
# Check what's consuming VRAM
nvidia-smi
# Kill orphaned GPU processes
sudo fuser -v /dev/nvidia*
kill -9
# In Python, force garbage collection
import torch, gc
gc.collect()
torch.cuda.empty_cache()
Stale processes from previous crashed generations are the most common hidden VRAM thief. Always check before debugging further.
Fix 1: Reduce Resolution and Batch Size
VRAM scales quadratically with resolution. Halving the dimensions cuts memory to roughly one quarter:
# SDXL native resolution (high VRAM)
image = pipe("a landscape", height=1024, width=1024).images[0]
# Reduced resolution (much lower VRAM)
image = pipe("a landscape", height=768, width=768).images[0]
# Single image instead of batch
image = pipe("a landscape", num_images_per_prompt=1).images[0]
Generate at a lower resolution and upscale afterwards with a separate model or Real-ESRGAN for quality comparable to native high-res generation.
Fix 2: Enable Memory-Efficient Attention
Attention computation is the largest single VRAM consumer during generation:
# Option A: xformers (fastest, best memory efficiency)
pip install xformers
pipe.enable_xformers_memory_efficient_attention()
# Option B: PyTorch SDP attention (no extra install)
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True
).to("cuda")
pipe.enable_attention_slicing(slice_size="auto")
xformers typically saves 30-40% VRAM compared to default attention. Install it following our PyTorch GPU guide for the correct CUDA version.
Fix 3: VAE Tiling and Slicing
The VAE decoder processes the entire latent image at once. For high resolutions, this single operation can exceed available VRAM:
# Enable VAE slicing (processes one slice at a time)
pipe.enable_vae_slicing()
# Enable VAE tiling (tiles the decode for very high res)
pipe.enable_vae_tiling()
# Use FP16 VAE to halve its memory footprint
pipe.vae = pipe.vae.to(dtype=torch.float16)
VAE slicing adds minimal latency but dramatically reduces peak VRAM usage during the decode step.
Fix 4: Model CPU Offloading
When nothing else fits, offload model components to CPU RAM between pipeline stages:
# Sequential CPU offloading (slowest but minimum VRAM)
pipe.enable_sequential_cpu_offload()
# Model CPU offloading (faster, moderate VRAM savings)
pipe.enable_model_cpu_offload()
# Comparison on RTX 3090 (24 GB) with SDXL:
# Default: ~12 GB peak VRAM
# enable_model_cpu_offload(): ~8 GB peak VRAM
# enable_sequential_cpu_offload(): ~4 GB peak VRAM
CPU offloading trades generation speed for VRAM savings. Use it as a last resort when other optimisations are insufficient.
Plan Your VRAM Budget
Know what fits on your GPU before hitting OOM. For Stable Diffusion hosting, an RTX 3090 handles SD 1.5 at 1024×1024 or SDXL at 768×768 without special optimisations. An RTX 6000 Pro 96 GB runs Flux at full resolution with room for batching. Check the benchmarks for detailed VRAM profiles. ComfyUI offers node-level memory control for complex workflows. Our CUDA guide ensures your drivers are current, the Docker GPU guide covers containerised setups, and the tutorials section has more PyTorch optimisation techniques.
High-VRAM GPUs for Image Generation
GigaGPU servers with RTX 6000 Pro 96 GB and multi-GPU options — generate at any resolution without memory errors.
Browse GPU Servers