RTX 3050 - Order Now
Home / Blog / Model Guides / Stable Diffusion XL VRAM Requirements: From 6 GB Minimum to Production-Ready
Model Guides

Stable Diffusion XL VRAM Requirements: From 6 GB Minimum to Production-Ready

How much VRAM does SDXL actually need? Numbers for FP16, FP8, INT8, with and without ControlNets, LoRAs and refiners. Plus the GPU we recommend for each tier.

SDXL is the open-weight image model that scales most predictably with VRAM. Doubles in memory cleanly between FP16 and FP8, plays well with offloading, and the rendering pipeline (UNet → VAE → optional refiner) has a known VRAM signature you can plan against. This page is the precise sizing reference.

TL;DR

SDXL base needs ~8 GB at FP16 for the UNet, plus 1–2 GB for the VAE and text encoders. With LoRAs and a single ControlNet you’re at ~12 GB. With the refiner ensemble, ~16 GB. So:

  • 6 GB cards — works with offloading, slow
  • 8 GB — comfortable base SDXL
  • 12 GB — base + LoRAs + 1 ControlNet
  • 16 GB — full refiner ensemble
  • 24 GB+ — multi-pipeline, batch generation

SDXL base model VRAM

ComponentFP32FP16FP8INT8
UNet (2.6B params)10.4 GB5.2 GB2.6 GB2.6 GB
VAE0.4 GB0.2 GB0.1 GB0.1 GB
Text encoder 1 (CLIP-L)0.5 GB0.25 GB0.13 GB0.13 GB
Text encoder 2 (OpenCLIP-G)2.8 GB1.4 GB0.7 GB0.7 GB
Activations + buffers~2 GB~1.5 GB~1.5 GB~1.5 GB
Total (1024×1024)~16 GB~8.5 GB~5 GB~5 GB

Almost nobody runs SDXL at FP32 in production — FP16 is the default. FP8 is supported on Blackwell (5080/5090/6000 Pro) via TensorRT-LLM-style quantisation; quality drop is <1%.

SDXL variants — Turbo, Lightning, refiner

  • SDXL Base — the original. 25–50 sampling steps. ~8 GB FP16.
  • SDXL Refiner — second-pass model that adds detail. ~5 GB FP16. Add to a Base pipeline → ~13 GB total.
  • SDXL Turbo — distilled for 1–4 step generation. Same VRAM as Base; just faster.
  • SDXL Lightning — LCM-style 2/4/8-step distilled model. Same VRAM.
  • Hyper-SDXL — 1-step variant, same VRAM.

ControlNets, IP-Adapters, LoRAs

Add-onVRAM costNotes
LoRA (single, rank 32-128)~50–200 MBTrivial. Hot-load 8–10 LoRAs on a 12 GB card.
LoRA (XL trained, rank 256+)~400 MBLarger; trim if VRAM-tight.
ControlNet (one)~2.5 GB FP16 / 1.3 GB FP8Each adds an additional UNet pass.
IP-Adapter~0.5 GBCheap. Hot-load several.
Inpainting model~1 GB additionalOn top of base.
Refiner~5 GB FP16Doubles the pipeline VRAM at peak.

Concrete example: SDXL Base + 1 ControlNet + 2 LoRAs + IP-Adapter at FP16 = ~12 GB peak. Comfortable on a 3090 24 GB; tight on a 5080 16 GB.

GPU recommendations by tier

TierGPUWhat it handles
MinimumRTX 3050 6 GBSDXL FP16 with sequential CPU offload only. ~25 s per 1024.
EntryRTX 4060 8 GBSDXL Base FP16 fits. No room for ControlNets at FP16.
ComfortableRTX 3060 12 GBBase + LoRAs + 1 ControlNet. The first card we recommend.
Sweet spotRTX 5080 16 GBBase + LoRAs + 2 ControlNets. FP8 path. Fast.
ProductionRTX 5090 32 GBFull ensemble + batch + multiple pipelines hot-loaded.
WorkstationRTX 6000 Pro 96 GBMulti-model serving (SDXL + FLUX + SD3).

Memory-saving tricks that actually work

  1. Sequential CPU offload — moves text encoders + VAE to CPU between forward passes. Cuts VRAM to ~5 GB. Costs ~30% in latency. Diffusers: pipe.enable_sequential_cpu_offload().
  2. VAE tiling + slicing — decodes the latent in tiles. Lets you generate >1024² on small cards. pipe.enable_vae_tiling().
  3. FP8 weights + FP16 cache — halves UNet VRAM with <1% quality regression on Blackwell.
  4. Attention slicing — recomputes attention in chunks. Mostly obsoleted by xformers, but still useful as a fallback.
  5. xformers / SDPA — efficient attention kernels. Saves 20–30% peak VRAM and is faster.
  6. torch.compile — JIT-compiled UNet. ~15% faster, no VRAM cost. Long warm-up though.

Speed by GPU

Steps × resolution × sampler combine to determine wall time. Reference numbers below use Euler-A with 30 steps at 1024×1024.

GPUSDXL Base FP16SDXL Turbo (4-step)Notes
RTX 3050 6 GB25 s4 sCPU offload required
RTX 3060 12 GB11 s2 sComfortable
RTX 5060 Ti 16 GB7 s1.4 sBest entry-tier speed
RTX 50805 s1.0 sFP8 path drops to 3.5 s
RTX 30905 s1.1 s24 GB headroom
RTX 50902.5 s0.6 sFP8 path; batch-friendly
RTX 6000 Pro2.6 s0.6 sSame speed, more headroom

Bottom line

For new SDXL deployments on dedicated hardware, take the RTX 5090 — fastest per image, FP8 hardware, 32 GB lets you run multiple pipelines hot-loaded. If cost-anchored, the RTX 3060 12 GB at £99/mo is the cheapest dedicated server we host that runs full SDXL comfortably. See our image generator hosting hub for a deeper deployment guide and our RTX 5090 + Stable Diffusion verdict for the spec-by-spec.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?