RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB SDXL Benchmark: 1024×1024 in 2.0s
Benchmarks

RTX 4090 24GB SDXL Benchmark: 1024×1024 in 2.0s

SDXL 1.0 at 1024x1024 on the RTX 4090 24GB renders a 30-step image in 2.0 seconds, batch of four in 6.5s. Per-resolution and per-batch tables, distilled variants, ControlNet stacks, cross-card comparison and production gotchas.

SDXL 1.0 remains the most-deployed high-resolution diffusion model in production. The base UNet is 2.6 billion parameters in FP16, which is small by 2026 standards but pairs with two text encoders and produces native 1024-pixel output through a sufficiently parameterised pipeline that latency on lesser cards is painful. On an RTX 4090 24GB dedicated host from Gigagpu, a single 1024×1024 image at 30 steps with the DPM++ 2M Karras sampler renders in 2.0 seconds. A batch of four completes in 6.5 seconds, which is 1.63 seconds per image — the natural sweet spot for studio-grade serving with comfortable VRAM headroom for ControlNet and a stack of LoRAs.

Contents

Methodology and test rig

All measurements use Diffusers 0.30 with PyTorch 2.5, sdpa attention via F.scaled_dot_product_attention, on a stock RTX 4090 24GB Founders Edition at 450W TDP. Host is a Ryzen 9 7950X with 64GB DDR5-5600 and a Samsung 990 Pro 2TB Gen 4 NVMe; OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Each result is the median of fifty runs after a five-image warm-up; standard deviation under 3% across the run.

Wall-clock latency includes text encoding and VAE decode but excludes any network or PIL save time. Batch numbers reflect a single forward pass producing N images simultaneously, not N sequential calls. Where we use torch.compile, the 8-12% gain it produces is reported in the configuration section but excluded from the headline numbers, since first-call compilation latency makes it inappropriate for cold endpoints.

Single-image latency by resolution

SDXL’s quadratic attention scaling means resolution dominates latency once you cross the 1024 threshold. The numbers below cover the practical native-trained resolutions plus the common 1.5x landscape variants:

ResolutionStepsSamplerLatencySteps/sVRAM
1024 × 102420DPM++ 2M Karras1.4 s14.39.0 GB
1024 × 102430DPM++ 2M Karras2.0 s15.09.0 GB
1024 × 102450Euler a3.3 s15.29.0 GB
1216 × 83230DPM++ 2M Karras2.1 s14.39.4 GB
1344 × 76830DPM++ 2M Karras2.0 s15.09.3 GB
1536 × 102430DPM++ 2M Karras3.1 s9.711.2 GB
2048 × 2048 (HiRes fix)30+15DPM++ 2M / Latent9.6 s17.8 GB

Note the cliff at 1536×1024 — the per-step time jumps from 67 ms to 103 ms because attention now operates on roughly 2.4x as many latent tokens. Above 1536 you should be running HiRes-fix or img2img upscaling rather than native generation.

Batched throughput and VRAM

Batching is the single highest-leverage optimisation for SDXL serving. The UNet weights stay resident; only activations and KV grow with batch. The 4090 has enough compute to push aggregate steps/s up to a practical ceiling around batch 4-6:

BatchLatency (30 steps)s/imageAggregate img/minVRAM peak
12.0 s2.00309.0 GB
23.4 s1.703511.5 GB
46.5 s1.633716.0 GB
69.4 s1.573822.5 GB (tight)
8 @ 1024OOM>24 GB
8 @ 89610.8 s1.354420.5 GB

Batch 4 at 1024 is the production default. Batch 6 squeezes a fraction more throughput but leaves no headroom for ControlNet or text-encoder peaks. If you really need batch 8, drop to 896×896 (still above the 768 native trained resolution) or enable VAE tiling, which adds about 80 ms of latency but caps VAE peak around 2 GB.

VRAM map and the bandwidth ceiling

The 4090’s 24GB GDDR6X delivers 1008 GB/s of memory bandwidth and 16,384 CUDA cores at 2.5 GHz, which is enough that SDXL’s UNet is compute-bound, not memory-bound, at native resolutions. That changes the optimisation calculus relative to LLM workloads:

ComponentFP16 bytesNotes
UNet weights5.0 GB2.6B params, FP16
VAE weights0.4 GBFP16; FP32 needed for fp16 colour bug workaround
CLIP-L + OpenCLIP-G1.7 GBResident; can offload between calls
Activations (1024 px, b=1)2.0 GBQuadratic in resolution, linear in batch
CUDA scratch + workspace0.4 GBFlashAttention buffers
VAE decode peak1.8 GBTransient; VAE tiling caps at ~2 GB

The math: each UNet step at 1024 touches roughly 5 GB of weights. At 1008 GB/s that is 5 ms of memory work, but the actual step takes 67 ms — the 4090 is doing real arithmetic, not just shuffling bytes. This is why FP8 quantisation of SDXL gives marginal wins (~10%) compared to the 40-50% wins it produces on bandwidth-bound LLM decode.

Distilled variants: Lightning, Turbo, LCM

For latency-critical paths, distilled SDXL variants reduce step counts dramatically. Quality drops are real but often acceptable for previews, thumbnails, or interactive sliders:

VariantStepsLatency b=1Latency b=4VRAMUse case
SDXL base302.0 s6.5 s9.0 GBFinal renders
SDXL Lightning 4-step40.7 s1.6 s9.0 GBPreviews, drafts
SDXL Turbo 4-step40.55 s1.4 s9.0 GBRealtime sliders
SDXL LCM 8-step81.0 s2.5 s9.2 GBMid-fidelity
SDXL Hyper 1-step10.18 s0.42 s9.0 GBThumbnail grids

SDXL Turbo at 4 steps and 0.55 seconds gives you 109 images per minute single-stream, fast enough for interactive design tools. The full SDXL setup guide documents how to swap between base and distilled checkpoints behind a single endpoint for quality/speed tiering.

Refiner, LoRA and ControlNet stacks

Real production pipelines rarely run base-only. Adding refiner, LoRAs and ControlNet to the same 4090:

StackLatency b=1VRAMNotes
SDXL base2.0 s9.0 GBBaseline
+ 4 LoRAs2.05 s10.0 GBPEFT adapter merging, no per-step cost
+ ControlNet (Canny)2.6 s11.5 GB+30% per step, +1.5 GB
+ ControlNet + IP-Adapter2.9 s13.0 GBTwo conditioning paths
+ Refiner (10 steps)2.7 s11.0 GBAdds 0.7s, 1.4 GB
Full stack: base + refiner + 4 LoRAs + 1 ControlNet3.6 s14.5 GBStudio default

LoRAs are nearly free per-image because PEFT merges the low-rank deltas into the UNet weights at load time. The cost is one-off load latency (~150 ms per LoRA) and a small VRAM bump. ControlNet, by contrast, runs a parallel 1.3B network at every step — the +30% latency is unavoidable.

Cross-card comparison

SDXL 1024×1024 at 30 steps DPM++ 2M Karras, batch 1:

GPUVRAMBandwidthSDXL b=1SDXL b=4Distilled 4-step
RTX 5060 Ti 16GB16448 GB/s4.8 s16.5 s1.6 s
RTX 5080 16GB16960 GB/s2.4 s8.1 s0.85 s
RTX 3090 24GB24936 GB/s2.7 s9.2 s0.95 s
RTX 4090 24GB241008 GB/s2.0 s6.5 s0.7 s
RTX 5090 32GB321792 GB/s1.2 s3.8 s0.42 s
H100 80GB803350 GB/s1.0 s3.1 s0.36 s

The 4090 sits in the price/perf sweet spot. The 5090 is faster but pricier per hour; the H100’s bandwidth advantage is wasted on a compute-bound workload. The 5060 Ti works for hobby use but multi-batch SDXL serving on it stops being economic above a few users.

Production gotchas

  • VAE FP16 colour bug. The default SDXL VAE produces saturated artefacts in FP16 on certain images. Either use the madebyollin/sdxl-vae-fp16-fix checkpoint or run the VAE in FP32, which costs ~150 ms per decode.
  • torch.compile cold start. First call after compile takes 90-120 seconds. Pre-warm before opening to traffic; persist the compiled cache between restarts via torch._inductor.config.fx_graph_cache = True.
  • OpenCLIP-G memory spikes. The text encoder briefly allocates 3.4 GB during its forward pass. If you’re already at 22 GB UNet+activation, you’ll OOM on text encode. Encode prompts first, free encoders, then run UNet.
  • ControlNet preprocessor on CPU is the bottleneck. Canny, depth and openpose preprocessing in PIL/OpenCV on CPU often takes 200-400 ms — longer than the actual generation. Move preprocessors to GPU (controlnet-aux supports this) or run them async.
  • LoRA scale stacking is not commutative. Loading LoRA A at scale 0.8 then B at 0.6 produces different output than loading B then A. Always load in deterministic order and document the stack.
  • Refiner needs the same noise schedule as base. Switching samplers between base and refiner produces seam artefacts at the handoff point. Stick to DPM++ 2M Karras for both.
  • Don’t stream Diffusers callbacks over network. The intermediate latent decode adds 30-50 ms per callback; stream only every 5th step or you waste 20% of your latency budget on previews.

Verdict: when to pick SDXL on a 4090

SDXL on the 4090 is the workhorse choice for any production image pipeline that needs 1024-pixel native output, ControlNet support, and a mature LoRA ecosystem. At 1.63 seconds per image batched, you serve 2,200 images per hour from one card with comfortable headroom for stacking. Pick it over FLUX.1-dev when speed matters more than fidelity, over FLUX schnell when you need ControlNet (which the FLUX ecosystem is still catching up on), and over SD 1.5 whenever you’re outputting above 768 pixels. Move to the 5090 only when you’ve maxed out a 4090 and need batch 8 at 1024 native — and check best GPU for Stable Diffusion for the broader landscape first.

SDXL at 1.6 seconds per image

Batched 1024-pixel generation on UK 4090 hosts. 2,200 images per hour, ControlNet and LoRA stacks fit comfortably in 24GB.

Order the RTX 4090 24GB

See also: Stable Diffusion setup, ComfyUI setup, FLUX schnell benchmark, FLUX dev benchmark, Stable Video Diffusion, image studio use case, 5060 Ti SDXL comparison, best GPU for SD.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?