RTX 3050 - Order Now
Home / Blog / Tutorials / Stable Diffusion Out of Memory: GPU Fix
Tutorials

Stable Diffusion Out of Memory: GPU Fix

Fix CUDA out of memory errors in Stable Diffusion. Covers resolution reduction, VAE slicing, attention optimization, xformers, model offloading, and VRAM management for SDXL and Flux models.

Symptom: Generation Crashes With CUDA OOM

You queue an image generation on your GPU server and Stable Diffusion dies mid-process with a wall of red text:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 23.65 GiB of which 1.12 GiB is free.

This happens most often when generating at high resolutions, using SDXL or Flux models, or running with batch sizes greater than one. The model, VAE, and intermediate tensors compete for the same VRAM pool, and one of them loses.

Immediate VRAM Recovery

# Check what's consuming VRAM
nvidia-smi

# Kill orphaned GPU processes
sudo fuser -v /dev/nvidia*
kill -9 

# In Python, force garbage collection
import torch, gc
gc.collect()
torch.cuda.empty_cache()

Stale processes from previous crashed generations are the most common hidden VRAM thief. Always check before debugging further.

Fix 1: Reduce Resolution and Batch Size

VRAM scales quadratically with resolution. Halving the dimensions cuts memory to roughly one quarter:

# SDXL native resolution (high VRAM)
image = pipe("a landscape", height=1024, width=1024).images[0]

# Reduced resolution (much lower VRAM)
image = pipe("a landscape", height=768, width=768).images[0]

# Single image instead of batch
image = pipe("a landscape", num_images_per_prompt=1).images[0]

Generate at a lower resolution and upscale afterwards with a separate model or Real-ESRGAN for quality comparable to native high-res generation.

Fix 2: Enable Memory-Efficient Attention

Attention computation is the largest single VRAM consumer during generation:

# Option A: xformers (fastest, best memory efficiency)
pip install xformers
pipe.enable_xformers_memory_efficient_attention()

# Option B: PyTorch SDP attention (no extra install)
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")
pipe.enable_attention_slicing(slice_size="auto")

xformers typically saves 30-40% VRAM compared to default attention. Install it following our PyTorch GPU guide for the correct CUDA version.

Fix 3: VAE Tiling and Slicing

The VAE decoder processes the entire latent image at once. For high resolutions, this single operation can exceed available VRAM:

# Enable VAE slicing (processes one slice at a time)
pipe.enable_vae_slicing()

# Enable VAE tiling (tiles the decode for very high res)
pipe.enable_vae_tiling()

# Use FP16 VAE to halve its memory footprint
pipe.vae = pipe.vae.to(dtype=torch.float16)

VAE slicing adds minimal latency but dramatically reduces peak VRAM usage during the decode step.

Fix 4: Model CPU Offloading

When nothing else fits, offload model components to CPU RAM between pipeline stages:

# Sequential CPU offloading (slowest but minimum VRAM)
pipe.enable_sequential_cpu_offload()

# Model CPU offloading (faster, moderate VRAM savings)
pipe.enable_model_cpu_offload()

# Comparison on RTX 3090 (24 GB) with SDXL:
# Default:                    ~12 GB peak VRAM
# enable_model_cpu_offload(): ~8 GB peak VRAM
# enable_sequential_cpu_offload(): ~4 GB peak VRAM

CPU offloading trades generation speed for VRAM savings. Use it as a last resort when other optimisations are insufficient.

Plan Your VRAM Budget

Know what fits on your GPU before hitting OOM. For Stable Diffusion hosting, an RTX 3090 handles SD 1.5 at 1024×1024 or SDXL at 768×768 without special optimisations. An RTX 6000 Pro 96 GB runs Flux at full resolution with room for batching. Check the benchmarks for detailed VRAM profiles. ComfyUI offers node-level memory control for complex workflows. Our CUDA guide ensures your drivers are current, the Docker GPU guide covers containerised setups, and the tutorials section has more PyTorch optimisation techniques.

High-VRAM GPUs for Image Generation

GigaGPU servers with RTX 6000 Pro 96 GB and multi-GPU options — generate at any resolution without memory errors.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?