SDXL 1.0 remains the most-deployed high-resolution diffusion model in production. The base UNet is 2.6 billion parameters in FP16, which is small by 2026 standards but pairs with two text encoders and produces native 1024-pixel output through a sufficiently parameterised pipeline that latency on lesser cards is painful. On an RTX 4090 24GB dedicated host from Gigagpu, a single 1024×1024 image at 30 steps with the DPM++ 2M Karras sampler renders in 2.0 seconds. A batch of four completes in 6.5 seconds, which is 1.63 seconds per image — the natural sweet spot for studio-grade serving with comfortable VRAM headroom for ControlNet and a stack of LoRAs.
Contents
- Methodology and test rig
- Single-image latency by resolution
- Batched throughput and VRAM
- VRAM map and the bandwidth ceiling
- Distilled variants: Lightning, Turbo, LCM
- Refiner, LoRA and ControlNet stacks
- Cross-card comparison
- Production gotchas
- Verdict: when to pick SDXL
Methodology and test rig
All measurements use Diffusers 0.30 with PyTorch 2.5, sdpa attention via F.scaled_dot_product_attention, on a stock RTX 4090 24GB Founders Edition at 450W TDP. Host is a Ryzen 9 7950X with 64GB DDR5-5600 and a Samsung 990 Pro 2TB Gen 4 NVMe; OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Each result is the median of fifty runs after a five-image warm-up; standard deviation under 3% across the run.
Wall-clock latency includes text encoding and VAE decode but excludes any network or PIL save time. Batch numbers reflect a single forward pass producing N images simultaneously, not N sequential calls. Where we use torch.compile, the 8-12% gain it produces is reported in the configuration section but excluded from the headline numbers, since first-call compilation latency makes it inappropriate for cold endpoints.
Single-image latency by resolution
SDXL’s quadratic attention scaling means resolution dominates latency once you cross the 1024 threshold. The numbers below cover the practical native-trained resolutions plus the common 1.5x landscape variants:
| Resolution | Steps | Sampler | Latency | Steps/s | VRAM |
|---|---|---|---|---|---|
| 1024 × 1024 | 20 | DPM++ 2M Karras | 1.4 s | 14.3 | 9.0 GB |
| 1024 × 1024 | 30 | DPM++ 2M Karras | 2.0 s | 15.0 | 9.0 GB |
| 1024 × 1024 | 50 | Euler a | 3.3 s | 15.2 | 9.0 GB |
| 1216 × 832 | 30 | DPM++ 2M Karras | 2.1 s | 14.3 | 9.4 GB |
| 1344 × 768 | 30 | DPM++ 2M Karras | 2.0 s | 15.0 | 9.3 GB |
| 1536 × 1024 | 30 | DPM++ 2M Karras | 3.1 s | 9.7 | 11.2 GB |
| 2048 × 2048 (HiRes fix) | 30+15 | DPM++ 2M / Latent | 9.6 s | — | 17.8 GB |
Note the cliff at 1536×1024 — the per-step time jumps from 67 ms to 103 ms because attention now operates on roughly 2.4x as many latent tokens. Above 1536 you should be running HiRes-fix or img2img upscaling rather than native generation.
Batched throughput and VRAM
Batching is the single highest-leverage optimisation for SDXL serving. The UNet weights stay resident; only activations and KV grow with batch. The 4090 has enough compute to push aggregate steps/s up to a practical ceiling around batch 4-6:
| Batch | Latency (30 steps) | s/image | Aggregate img/min | VRAM peak |
|---|---|---|---|---|
| 1 | 2.0 s | 2.00 | 30 | 9.0 GB |
| 2 | 3.4 s | 1.70 | 35 | 11.5 GB |
| 4 | 6.5 s | 1.63 | 37 | 16.0 GB |
| 6 | 9.4 s | 1.57 | 38 | 22.5 GB (tight) |
| 8 @ 1024 | OOM | — | — | >24 GB |
| 8 @ 896 | 10.8 s | 1.35 | 44 | 20.5 GB |
Batch 4 at 1024 is the production default. Batch 6 squeezes a fraction more throughput but leaves no headroom for ControlNet or text-encoder peaks. If you really need batch 8, drop to 896×896 (still above the 768 native trained resolution) or enable VAE tiling, which adds about 80 ms of latency but caps VAE peak around 2 GB.
VRAM map and the bandwidth ceiling
The 4090’s 24GB GDDR6X delivers 1008 GB/s of memory bandwidth and 16,384 CUDA cores at 2.5 GHz, which is enough that SDXL’s UNet is compute-bound, not memory-bound, at native resolutions. That changes the optimisation calculus relative to LLM workloads:
| Component | FP16 bytes | Notes |
|---|---|---|
| UNet weights | 5.0 GB | 2.6B params, FP16 |
| VAE weights | 0.4 GB | FP16; FP32 needed for fp16 colour bug workaround |
| CLIP-L + OpenCLIP-G | 1.7 GB | Resident; can offload between calls |
| Activations (1024 px, b=1) | 2.0 GB | Quadratic in resolution, linear in batch |
| CUDA scratch + workspace | 0.4 GB | FlashAttention buffers |
| VAE decode peak | 1.8 GB | Transient; VAE tiling caps at ~2 GB |
The math: each UNet step at 1024 touches roughly 5 GB of weights. At 1008 GB/s that is 5 ms of memory work, but the actual step takes 67 ms — the 4090 is doing real arithmetic, not just shuffling bytes. This is why FP8 quantisation of SDXL gives marginal wins (~10%) compared to the 40-50% wins it produces on bandwidth-bound LLM decode.
Distilled variants: Lightning, Turbo, LCM
For latency-critical paths, distilled SDXL variants reduce step counts dramatically. Quality drops are real but often acceptable for previews, thumbnails, or interactive sliders:
| Variant | Steps | Latency b=1 | Latency b=4 | VRAM | Use case |
|---|---|---|---|---|---|
| SDXL base | 30 | 2.0 s | 6.5 s | 9.0 GB | Final renders |
| SDXL Lightning 4-step | 4 | 0.7 s | 1.6 s | 9.0 GB | Previews, drafts |
| SDXL Turbo 4-step | 4 | 0.55 s | 1.4 s | 9.0 GB | Realtime sliders |
| SDXL LCM 8-step | 8 | 1.0 s | 2.5 s | 9.2 GB | Mid-fidelity |
| SDXL Hyper 1-step | 1 | 0.18 s | 0.42 s | 9.0 GB | Thumbnail grids |
SDXL Turbo at 4 steps and 0.55 seconds gives you 109 images per minute single-stream, fast enough for interactive design tools. The full SDXL setup guide documents how to swap between base and distilled checkpoints behind a single endpoint for quality/speed tiering.
Refiner, LoRA and ControlNet stacks
Real production pipelines rarely run base-only. Adding refiner, LoRAs and ControlNet to the same 4090:
| Stack | Latency b=1 | VRAM | Notes |
|---|---|---|---|
| SDXL base | 2.0 s | 9.0 GB | Baseline |
| + 4 LoRAs | 2.05 s | 10.0 GB | PEFT adapter merging, no per-step cost |
| + ControlNet (Canny) | 2.6 s | 11.5 GB | +30% per step, +1.5 GB |
| + ControlNet + IP-Adapter | 2.9 s | 13.0 GB | Two conditioning paths |
| + Refiner (10 steps) | 2.7 s | 11.0 GB | Adds 0.7s, 1.4 GB |
| Full stack: base + refiner + 4 LoRAs + 1 ControlNet | 3.6 s | 14.5 GB | Studio default |
LoRAs are nearly free per-image because PEFT merges the low-rank deltas into the UNet weights at load time. The cost is one-off load latency (~150 ms per LoRA) and a small VRAM bump. ControlNet, by contrast, runs a parallel 1.3B network at every step — the +30% latency is unavoidable.
Cross-card comparison
SDXL 1024×1024 at 30 steps DPM++ 2M Karras, batch 1:
| GPU | VRAM | Bandwidth | SDXL b=1 | SDXL b=4 | Distilled 4-step |
|---|---|---|---|---|---|
| RTX 5060 Ti 16GB | 16 | 448 GB/s | 4.8 s | 16.5 s | 1.6 s |
| RTX 5080 16GB | 16 | 960 GB/s | 2.4 s | 8.1 s | 0.85 s |
| RTX 3090 24GB | 24 | 936 GB/s | 2.7 s | 9.2 s | 0.95 s |
| RTX 4090 24GB | 24 | 1008 GB/s | 2.0 s | 6.5 s | 0.7 s |
| RTX 5090 32GB | 32 | 1792 GB/s | 1.2 s | 3.8 s | 0.42 s |
| H100 80GB | 80 | 3350 GB/s | 1.0 s | 3.1 s | 0.36 s |
The 4090 sits in the price/perf sweet spot. The 5090 is faster but pricier per hour; the H100’s bandwidth advantage is wasted on a compute-bound workload. The 5060 Ti works for hobby use but multi-batch SDXL serving on it stops being economic above a few users.
Production gotchas
- VAE FP16 colour bug. The default SDXL VAE produces saturated artefacts in FP16 on certain images. Either use the
madebyollin/sdxl-vae-fp16-fixcheckpoint or run the VAE in FP32, which costs ~150 ms per decode. - torch.compile cold start. First call after compile takes 90-120 seconds. Pre-warm before opening to traffic; persist the compiled cache between restarts via
torch._inductor.config.fx_graph_cache = True. - OpenCLIP-G memory spikes. The text encoder briefly allocates 3.4 GB during its forward pass. If you’re already at 22 GB UNet+activation, you’ll OOM on text encode. Encode prompts first, free encoders, then run UNet.
- ControlNet preprocessor on CPU is the bottleneck. Canny, depth and openpose preprocessing in PIL/OpenCV on CPU often takes 200-400 ms — longer than the actual generation. Move preprocessors to GPU (controlnet-aux supports this) or run them async.
- LoRA scale stacking is not commutative. Loading LoRA A at scale 0.8 then B at 0.6 produces different output than loading B then A. Always load in deterministic order and document the stack.
- Refiner needs the same noise schedule as base. Switching samplers between base and refiner produces seam artefacts at the handoff point. Stick to DPM++ 2M Karras for both.
- Don’t stream Diffusers callbacks over network. The intermediate latent decode adds 30-50 ms per callback; stream only every 5th step or you waste 20% of your latency budget on previews.
Verdict: when to pick SDXL on a 4090
SDXL on the 4090 is the workhorse choice for any production image pipeline that needs 1024-pixel native output, ControlNet support, and a mature LoRA ecosystem. At 1.63 seconds per image batched, you serve 2,200 images per hour from one card with comfortable headroom for stacking. Pick it over FLUX.1-dev when speed matters more than fidelity, over FLUX schnell when you need ControlNet (which the FLUX ecosystem is still catching up on), and over SD 1.5 whenever you’re outputting above 768 pixels. Move to the 5090 only when you’ve maxed out a 4090 and need batch 8 at 1024 native — and check best GPU for Stable Diffusion for the broader landscape first.
SDXL at 1.6 seconds per image
Batched 1024-pixel generation on UK 4090 hosts. 2,200 images per hour, ControlNet and LoRA stacks fit comfortably in 24GB.
Order the RTX 4090 24GBSee also: Stable Diffusion setup, ComfyUI setup, FLUX schnell benchmark, FLUX dev benchmark, Stable Video Diffusion, image studio use case, 5060 Ti SDXL comparison, best GPU for SD.