Home / Blog / Benchmarks / RTX 4090 24GB SDXL Benchmark: 1024×1024 in 2.0s

Benchmarks

RTX 4090 24GB SDXL Benchmark: 1024×1024 in 2.0s

SDXL 1.0 at 1024x1024 on the RTX 4090 24GB renders a 30-step image in 2.0 seconds, batch of four in 6.5s. Per-resolution and per-batch tables, distilled variants, ControlNet stacks, cross-card comparison and production gotchas.

Benchmarks May 4, 2026 6 min read gigagpu

SDXL 1.0 remains the most-deployed high-resolution diffusion model in production. The base UNet is 2.6 billion parameters in FP16, which is small by 2026 standards but pairs with two text encoders and produces native 1024-pixel output through a sufficiently parameterised pipeline that latency on lesser cards is painful. On an RTX 4090 24GB dedicated host from Gigagpu, a single 1024×1024 image at 30 steps with the DPM++ 2M Karras sampler renders in 2.0 seconds. A batch of four completes in 6.5 seconds, which is 1.63 seconds per image — the natural sweet spot for studio-grade serving with comfortable VRAM headroom for ControlNet and a stack of LoRAs.

Methodology and test rig

All measurements use Diffusers 0.30 with PyTorch 2.5, sdpa attention via F.scaled_dot_product_attention, on a stock RTX 4090 24GB Founders Edition at 450W TDP. Host is a Ryzen 9 7950X with 64GB DDR5-5600 and a Samsung 990 Pro 2TB Gen 4 NVMe; OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Each result is the median of fifty runs after a five-image warm-up; standard deviation under 3% across the run.

Wall-clock latency includes text encoding and VAE decode but excludes any network or PIL save time. Batch numbers reflect a single forward pass producing N images simultaneously, not N sequential calls. Where we use torch.compile, the 8-12% gain it produces is reported in the configuration section but excluded from the headline numbers, since first-call compilation latency makes it inappropriate for cold endpoints.

Single-image latency by resolution

SDXL’s quadratic attention scaling means resolution dominates latency once you cross the 1024 threshold. The numbers below cover the practical native-trained resolutions plus the common 1.5x landscape variants:

Resolution	Steps	Sampler	Latency	Steps/s	VRAM
1024 × 1024	20	DPM++ 2M Karras	1.4 s	14.3	9.0 GB
1024 × 1024	30	DPM++ 2M Karras	2.0 s	15.0	9.0 GB
1024 × 1024	50	Euler a	3.3 s	15.2	9.0 GB
1216 × 832	30	DPM++ 2M Karras	2.1 s	14.3	9.4 GB
1344 × 768	30	DPM++ 2M Karras	2.0 s	15.0	9.3 GB
1536 × 1024	30	DPM++ 2M Karras	3.1 s	9.7	11.2 GB
2048 × 2048 (HiRes fix)	30+15	DPM++ 2M / Latent	9.6 s	—	17.8 GB

Note the cliff at 1536×1024 — the per-step time jumps from 67 ms to 103 ms because attention now operates on roughly 2.4x as many latent tokens. Above 1536 you should be running HiRes-fix or img2img upscaling rather than native generation.

Batched throughput and VRAM

Batching is the single highest-leverage optimisation for SDXL serving. The UNet weights stay resident; only activations and KV grow with batch. The 4090 has enough compute to push aggregate steps/s up to a practical ceiling around batch 4-6:

Batch	Latency (30 steps)	s/image	Aggregate img/min	VRAM peak
1	2.0 s	2.00	30	9.0 GB
2	3.4 s	1.70	35	11.5 GB
4	6.5 s	1.63	37	16.0 GB
6	9.4 s	1.57	38	22.5 GB (tight)
8 @ 1024	OOM	—	—	>24 GB
8 @ 896	10.8 s	1.35	44	20.5 GB

Batch 4 at 1024 is the production default. Batch 6 squeezes a fraction more throughput but leaves no headroom for ControlNet or text-encoder peaks. If you really need batch 8, drop to 896×896 (still above the 768 native trained resolution) or enable VAE tiling, which adds about 80 ms of latency but caps VAE peak around 2 GB.

VRAM map and the bandwidth ceiling

The 4090’s 24GB GDDR6X delivers 1008 GB/s of memory bandwidth and 16,384 CUDA cores at 2.5 GHz, which is enough that SDXL’s UNet is compute-bound, not memory-bound, at native resolutions. That changes the optimisation calculus relative to LLM workloads:

Component	FP16 bytes	Notes
UNet weights	5.0 GB	2.6B params, FP16
VAE weights	0.4 GB	FP16; FP32 needed for fp16 colour bug workaround
CLIP-L + OpenCLIP-G	1.7 GB	Resident; can offload between calls
Activations (1024 px, b=1)	2.0 GB	Quadratic in resolution, linear in batch
CUDA scratch + workspace	0.4 GB	FlashAttention buffers
VAE decode peak	1.8 GB	Transient; VAE tiling caps at ~2 GB

The math: each UNet step at 1024 touches roughly 5 GB of weights. At 1008 GB/s that is 5 ms of memory work, but the actual step takes 67 ms — the 4090 is doing real arithmetic, not just shuffling bytes. This is why FP8 quantisation of SDXL gives marginal wins (~10%) compared to the 40-50% wins it produces on bandwidth-bound LLM decode.

Distilled variants: Lightning, Turbo, LCM

For latency-critical paths, distilled SDXL variants reduce step counts dramatically. Quality drops are real but often acceptable for previews, thumbnails, or interactive sliders:

Variant	Steps	Latency b=1	Latency b=4	VRAM	Use case
SDXL base	30	2.0 s	6.5 s	9.0 GB	Final renders
SDXL Lightning 4-step	4	0.7 s	1.6 s	9.0 GB	Previews, drafts
SDXL Turbo 4-step	4	0.55 s	1.4 s	9.0 GB	Realtime sliders
SDXL LCM 8-step	8	1.0 s	2.5 s	9.2 GB	Mid-fidelity
SDXL Hyper 1-step	1	0.18 s	0.42 s	9.0 GB	Thumbnail grids

SDXL Turbo at 4 steps and 0.55 seconds gives you 109 images per minute single-stream, fast enough for interactive design tools. The full SDXL setup guide documents how to swap between base and distilled checkpoints behind a single endpoint for quality/speed tiering.

Refiner, LoRA and ControlNet stacks

Real production pipelines rarely run base-only. Adding refiner, LoRAs and ControlNet to the same 4090:

Stack	Latency b=1	VRAM	Notes
SDXL base	2.0 s	9.0 GB	Baseline
+ 4 LoRAs	2.05 s	10.0 GB	PEFT adapter merging, no per-step cost
+ ControlNet (Canny)	2.6 s	11.5 GB	+30% per step, +1.5 GB
+ ControlNet + IP-Adapter	2.9 s	13.0 GB	Two conditioning paths
+ Refiner (10 steps)	2.7 s	11.0 GB	Adds 0.7s, 1.4 GB
Full stack: base + refiner + 4 LoRAs + 1 ControlNet	3.6 s	14.5 GB	Studio default

LoRAs are nearly free per-image because PEFT merges the low-rank deltas into the UNet weights at load time. The cost is one-off load latency (~150 ms per LoRA) and a small VRAM bump. ControlNet, by contrast, runs a parallel 1.3B network at every step — the +30% latency is unavoidable.

Cross-card comparison

SDXL 1024×1024 at 30 steps DPM++ 2M Karras, batch 1:

GPU	VRAM	Bandwidth	SDXL b=1	SDXL b=4	Distilled 4-step
RTX 5060 Ti 16GB	16	448 GB/s	4.8 s	16.5 s	1.6 s
RTX 5080 16GB	16	960 GB/s	2.4 s	8.1 s	0.85 s
RTX 3090 24GB	24	936 GB/s	2.7 s	9.2 s	0.95 s
RTX 4090 24GB	24	1008 GB/s	2.0 s	6.5 s	0.7 s
RTX 5090 32GB	32	1792 GB/s	1.2 s	3.8 s	0.42 s
H100 80GB	80	3350 GB/s	1.0 s	3.1 s	0.36 s

The 4090 sits in the price/perf sweet spot. The 5090 is faster but pricier per hour; the H100’s bandwidth advantage is wasted on a compute-bound workload. The 5060 Ti works for hobby use but multi-batch SDXL serving on it stops being economic above a few users.

Production gotchas

VAE FP16 colour bug. The default SDXL VAE produces saturated artefacts in FP16 on certain images. Either use the madebyollin/sdxl-vae-fp16-fix checkpoint or run the VAE in FP32, which costs ~150 ms per decode.
torch.compile cold start. First call after compile takes 90-120 seconds. Pre-warm before opening to traffic; persist the compiled cache between restarts via torch._inductor.config.fx_graph_cache = True.
OpenCLIP-G memory spikes. The text encoder briefly allocates 3.4 GB during its forward pass. If you’re already at 22 GB UNet+activation, you’ll OOM on text encode. Encode prompts first, free encoders, then run UNet.
ControlNet preprocessor on CPU is the bottleneck. Canny, depth and openpose preprocessing in PIL/OpenCV on CPU often takes 200-400 ms — longer than the actual generation. Move preprocessors to GPU (controlnet-aux supports this) or run them async.
LoRA scale stacking is not commutative. Loading LoRA A at scale 0.8 then B at 0.6 produces different output than loading B then A. Always load in deterministic order and document the stack.
Refiner needs the same noise schedule as base. Switching samplers between base and refiner produces seam artefacts at the handoff point. Stick to DPM++ 2M Karras for both.
Don’t stream Diffusers callbacks over network. The intermediate latent decode adds 30-50 ms per callback; stream only every 5th step or you waste 20% of your latency budget on previews.

Verdict: when to pick SDXL on a 4090

SDXL on the 4090 is the workhorse choice for any production image pipeline that needs 1024-pixel native output, ControlNet support, and a mature LoRA ecosystem. At 1.63 seconds per image batched, you serve 2,200 images per hour from one card with comfortable headroom for stacking. Pick it over FLUX.1-dev when speed matters more than fidelity, over FLUX schnell when you need ControlNet (which the FLUX ecosystem is still catching up on), and over SD 1.5 whenever you’re outputting above 768 pixels. Move to the 5090 only when you’ve maxed out a 4090 and need batch 8 at 1024 native — and check best GPU for Stable Diffusion for the broader landscape first.

SDXL at 1.6 seconds per image

Batched 1024-pixel generation on UK 4090 hosts. 2,200 images per hour, ControlNet and LoRA stacks fit comfortably in 24GB.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB SDXL Benchmark: 1024×1024 in 2.0s

Contents

Methodology and test rig

Single-image latency by resolution

Batched throughput and VRAM

VRAM map and the bandwidth ceiling

Distilled variants: Lightning, Turbo, LCM

Refiner, LoRA and ControlNet stacks

Cross-card comparison

Production gotchas

Verdict: when to pick SDXL on a 4090

SDXL at 1.6 seconds per image

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB SDXL Benchmark: 1024×1024 in 2.0s

Contents

Methodology and test rig

Single-image latency by resolution

Batched throughput and VRAM

VRAM map and the bandwidth ceiling

Distilled variants: Lightning, Turbo, LCM

Refiner, LoRA and ControlNet stacks

Cross-card comparison

Production gotchas

Verdict: when to pick SDXL on a 4090

SDXL at 1.6 seconds per image

Need a Dedicated GPU Server?

gigagpu

Related Articles

Tokens Per Second Benchmark Across Every GPU We Host

Stable Diffusion XL on RTX 3090: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sdxl-on-rtx-3090-benchmark, Excerpt: Stable Diffusion XL benchmarked on RTX 3090: 3.2 it/s, 6.4 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

Gemma 2 9B on RTX 3050: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-3050-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 3050: 8.4 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

NUMA-Aware AI Inference Optimization

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?