Table of Contents
SDXL is the open-weight image model that scales most predictably with VRAM. Doubles in memory cleanly between FP16 and FP8, plays well with offloading, and the rendering pipeline (UNet → VAE → optional refiner) has a known VRAM signature you can plan against. This page is the precise sizing reference.
SDXL base needs ~8 GB at FP16 for the UNet, plus 1–2 GB for the VAE and text encoders. With LoRAs and a single ControlNet you’re at ~12 GB. With the refiner ensemble, ~16 GB. So:
- 6 GB cards — works with offloading, slow
- 8 GB — comfortable base SDXL
- 12 GB — base + LoRAs + 1 ControlNet
- 16 GB — full refiner ensemble
- 24 GB+ — multi-pipeline, batch generation
SDXL base model VRAM
| Component | FP32 | FP16 | FP8 | INT8 |
|---|---|---|---|---|
| UNet (2.6B params) | 10.4 GB | 5.2 GB | 2.6 GB | 2.6 GB |
| VAE | 0.4 GB | 0.2 GB | 0.1 GB | 0.1 GB |
| Text encoder 1 (CLIP-L) | 0.5 GB | 0.25 GB | 0.13 GB | 0.13 GB |
| Text encoder 2 (OpenCLIP-G) | 2.8 GB | 1.4 GB | 0.7 GB | 0.7 GB |
| Activations + buffers | ~2 GB | ~1.5 GB | ~1.5 GB | ~1.5 GB |
| Total (1024×1024) | ~16 GB | ~8.5 GB | ~5 GB | ~5 GB |
Almost nobody runs SDXL at FP32 in production — FP16 is the default. FP8 is supported on Blackwell (5080/5090/6000 Pro) via TensorRT-LLM-style quantisation; quality drop is <1%.
SDXL variants — Turbo, Lightning, refiner
- SDXL Base — the original. 25–50 sampling steps. ~8 GB FP16.
- SDXL Refiner — second-pass model that adds detail. ~5 GB FP16. Add to a Base pipeline → ~13 GB total.
- SDXL Turbo — distilled for 1–4 step generation. Same VRAM as Base; just faster.
- SDXL Lightning — LCM-style 2/4/8-step distilled model. Same VRAM.
- Hyper-SDXL — 1-step variant, same VRAM.
ControlNets, IP-Adapters, LoRAs
| Add-on | VRAM cost | Notes |
|---|---|---|
| LoRA (single, rank 32-128) | ~50–200 MB | Trivial. Hot-load 8–10 LoRAs on a 12 GB card. |
| LoRA (XL trained, rank 256+) | ~400 MB | Larger; trim if VRAM-tight. |
| ControlNet (one) | ~2.5 GB FP16 / 1.3 GB FP8 | Each adds an additional UNet pass. |
| IP-Adapter | ~0.5 GB | Cheap. Hot-load several. |
| Inpainting model | ~1 GB additional | On top of base. |
| Refiner | ~5 GB FP16 | Doubles the pipeline VRAM at peak. |
Concrete example: SDXL Base + 1 ControlNet + 2 LoRAs + IP-Adapter at FP16 = ~12 GB peak. Comfortable on a 3090 24 GB; tight on a 5080 16 GB.
GPU recommendations by tier
| Tier | GPU | What it handles |
|---|---|---|
| Minimum | RTX 3050 6 GB | SDXL FP16 with sequential CPU offload only. ~25 s per 1024. |
| Entry | RTX 4060 8 GB | SDXL Base FP16 fits. No room for ControlNets at FP16. |
| Comfortable | RTX 3060 12 GB | Base + LoRAs + 1 ControlNet. The first card we recommend. |
| Sweet spot | RTX 5080 16 GB | Base + LoRAs + 2 ControlNets. FP8 path. Fast. |
| Production | RTX 5090 32 GB | Full ensemble + batch + multiple pipelines hot-loaded. |
| Workstation | RTX 6000 Pro 96 GB | Multi-model serving (SDXL + FLUX + SD3). |
Memory-saving tricks that actually work
- Sequential CPU offload — moves text encoders + VAE to CPU between forward passes. Cuts VRAM to ~5 GB. Costs ~30% in latency. Diffusers:
pipe.enable_sequential_cpu_offload(). - VAE tiling + slicing — decodes the latent in tiles. Lets you generate >1024² on small cards.
pipe.enable_vae_tiling(). - FP8 weights + FP16 cache — halves UNet VRAM with <1% quality regression on Blackwell.
- Attention slicing — recomputes attention in chunks. Mostly obsoleted by xformers, but still useful as a fallback.
- xformers / SDPA — efficient attention kernels. Saves 20–30% peak VRAM and is faster.
- torch.compile — JIT-compiled UNet. ~15% faster, no VRAM cost. Long warm-up though.
Speed by GPU
Steps × resolution × sampler combine to determine wall time. Reference numbers below use Euler-A with 30 steps at 1024×1024.
| GPU | SDXL Base FP16 | SDXL Turbo (4-step) | Notes |
|---|---|---|---|
| RTX 3050 6 GB | 25 s | 4 s | CPU offload required |
| RTX 3060 12 GB | 11 s | 2 s | Comfortable |
| RTX 5060 Ti 16 GB | 7 s | 1.4 s | Best entry-tier speed |
| RTX 5080 | 5 s | 1.0 s | FP8 path drops to 3.5 s |
| RTX 3090 | 5 s | 1.1 s | 24 GB headroom |
| RTX 5090 | 2.5 s | 0.6 s | FP8 path; batch-friendly |
| RTX 6000 Pro | 2.6 s | 0.6 s | Same speed, more headroom |
Bottom line
For new SDXL deployments on dedicated hardware, take the RTX 5090 — fastest per image, FP8 hardware, 32 GB lets you run multiple pipelines hot-loaded. If cost-anchored, the RTX 3060 12 GB at £99/mo is the cheapest dedicated server we host that runs full SDXL comfortably. See our image generator hosting hub for a deeper deployment guide and our RTX 5090 + Stable Diffusion verdict for the spec-by-spec.