The RTX 4090 24GB is the sweet spot for serious diffusion work outside data-centre cards. It runs SDXL at 1.6 seconds per 1024 image batched, fits FLUX.1-dev FP8 at 4.1 seconds, and serves SD 1.5 with ControlNet and a stack of LoRAs from one UK GPU host. This post covers the named workload — a 12-engineer creative team running an internal image studio plus a public-facing API — the model lineup, capacity numbers and production gotchas we have seen across two years of deployments.
Contents
- Named workload: 12-engineer studio
- Model lineup and trade-offs
- Throughput benchmarks
- FLUX.1-dev in 24 GB
- ControlNet, LoRA and IP-Adapter stacks
- Studio capacity and scaling triggers
- Multi-pipeline serving config
- Production gotchas
- Verdict: when to pick a 4090
Named workload: 12-engineer studio
Reference workload: a UK product design agency with 12 creative engineers running an internal image studio (peak 4-6 concurrent designers iterating in ComfyUI), plus a public-facing customer endpoint exposing SDXL and FLUX schnell behind their own brand. Daily image volume in the last quarter: 14,800 internal renders, 38,400 customer-facing API renders. Peak hourly burst observed: 2,400 images.
SLA targets: internal designers want sub-3-second SDXL renders and sub-6-second FLUX renders for interactive iteration; the customer API has a soft 10-second wall-clock budget. Both must run with ControlNet stacks for the brand-consistency LoRA bundle (3 LoRAs always loaded, 1-2 ControlNets common).
This workload runs on three 4090s in production: one dedicated to interactive ComfyUI sessions, two pooled behind a least-loaded balancer for the API. Total cost is roughly 18% of the equivalent SaaS image API spend at their volume.
Model lineup and trade-offs
| Model | Precision | VRAM | Best for | Latency 1024 b=1 |
|---|---|---|---|---|
| SDXL base + refiner | FP16 | ~10 GB | 1024-1536 photoreal, ControlNet | 2.7 s |
| SDXL Lightning 4-step | FP16 | ~9 GB | Interactive previews | 0.7 s |
| SDXL Turbo 4-step | FP16 | ~9 GB | Realtime sliders | 0.55 s |
| SD 1.5 | FP16 | ~4 GB | LoRA-heavy stylised, batch huge | 0.45 s |
| FLUX.1-dev FP8 | FP8 | ~14 GB | Best-in-class fidelity | 4.1 s |
| FLUX.1-dev FP16 | FP16 | ~22 GB | Text-in-image, max quality | 6.2 s |
| FLUX.1-schnell FP8 | FP8 | ~14 GB | 4-step FLUX quality | 1.8 s |
The decision tree we recommend: SDXL for default 1024-pixel work because of mature ControlNet and LoRA ecosystem; FLUX.1-dev when you need state-of-the-art fidelity and ControlNet isn’t required; FLUX schnell for high-volume FLUX-quality serving; SD 1.5 only when you have brand-style LoRAs that don’t transfer up to SDXL.
Throughput benchmarks
Per-image wall-clock at default schedulers, Diffusers 0.30 with PyTorch 2.5, sdpa attention, on a stock 4090:
| Pipeline | Resolution | Steps | Latency b=1 | Latency b=4 | img/min batched |
|---|---|---|---|---|---|
| SDXL base | 1024×1024 | 30 (DPM++ 2M) | 2.0 s | 6.5 s | 37 |
| SDXL base + refiner | 1024×1024 | 30+10 | 2.7 s | 8.1 s | 30 |
| SDXL Lightning | 1024×1024 | 4 | 0.7 s | 1.6 s | 150 |
| SDXL Turbo | 1024×1024 | 4 | 0.55 s | 1.4 s | 171 |
| SD 1.5 | 768×768 | 30 | 0.95 s | 2.6 s | 92 |
| FLUX.1-dev FP8 | 1024×1024 | 30 (Euler) | 4.1 s | 12.4 s | 19 |
| FLUX.1-schnell FP8 | 1024×1024 | 4 | 1.8 s | 5.6 s | 43 |
SDXL base at 1.6 s/image batched gives 2,200 images/hour per card. FLUX.1-dev FP8 at 3.1 s/image batched gives 1,160/hour with state-of-the-art fidelity. FLUX.1-schnell FP8 at 1.4 s/image is the practical default for high-volume FLUX-quality serving. SD 1.5 at 0.65 s/image batched can hit 5,500 images/hour for stylised low-resolution work.
FLUX.1-dev in 24 GB
FLUX.1-dev’s 12B-parameter MMDiT transformer at FP16 plus the T5-XXL text encoder needs careful VRAM accounting on a 24 GB card. The trick is sequential CPU offload of T5: encode the prompt, free T5 to system RAM, then load the transformer for denoising. Peak GPU VRAM lands at ~22 GB, leaving a safe 2 GB margin.
FP8 quantisation via torchao changes the calculus completely. The transformer drops to 11.8 GB resident with zero offload, leaving 12 GB free for batching, ControlNet, or co-located smaller models. FP8 also runs 1.5x faster than BF16 on the 4090’s native fourth-generation FP8 tensor cores. Quality measurements across 300 prompts show CLIP-T -0.2 and LPIPS 0.061 vs FP16 — within seed-to-seed variance for everything except very small text inside images. The full FLUX setup guide documents the recipe.
ControlNet, LoRA and IP-Adapter stacks
Real studio work always involves stacking. The 4090 has enough VRAM to handle aggressive composition:
| Stack | SDXL latency b=1 | VRAM | Notes |
|---|---|---|---|
| SDXL base | 2.0 s | 9.0 GB | Baseline |
| + 4 LoRAs (PEFT merge) | 2.05 s | 10.0 GB | One-off load cost only |
| + ControlNet (Canny) | 2.6 s | 11.5 GB | +30% per step |
| + ControlNet (Canny + Depth) | 3.2 s | 13.5 GB | Two parallel paths |
| + IP-Adapter Plus | 2.4 s | 11.2 GB | Image conditioning |
| + Refiner | 2.7 s | 11.0 GB | 10-step refiner |
| Studio default: base + refiner + 4 LoRAs + 1 ControlNet | 3.6 s | 14.5 GB | Brand-consistent renders |
| Heavy stack: base + refiner + 4 LoRAs + 2 ControlNets + IP-Adapter | 4.4 s | 17.2 GB | Complex compositions |
LoRAs are nearly free per-image because PEFT merges low-rank deltas into the UNet weights at load time. ControlNet, by contrast, runs a parallel 1.3B network at every step — the +30% per ControlNet is unavoidable. The 4090’s 24GB headroom lets you stack all of this without OOM; on a 16GB card you would have to choose between batching and stacking.
Studio capacity and scaling triggers
| Workload | Per-4090 capacity | Notes |
|---|---|---|
| SDXL 1024 photoreal (default stack) | 1,800 images/hour | Batch 4 with ControlNet + LoRAs |
| FLUX.1-dev FP8 premium | 1,160 images/hour | Batch 4, no ControlNet |
| FLUX.1-schnell FP8 high-volume | 2,560 images/hour | Batch 4, 4-step |
| SD 1.5 brand-style with 5 LoRAs | 5,500 images/hour | Batch 8 at 768px |
| SDXL Turbo realtime | 10,200 images/hour | Batch 4, 4-step |
| Concurrent designers (interactive) | 4-6 active sessions | Mixed pipeline |
Scaling triggers for the named 12-engineer studio:
- Add a card at 5,000+ daily renders sustained. One 4090 covers ~30,000 SDXL/day comfortably; beyond that latency starts breaking SLA.
- Split FLUX traffic to a dedicated card at 600+ FLUX renders/day. Mixed FLUX/SDXL on one card means cold-swap penalties (8-12 seconds) hurt interactive UX.
- Promote to 5090 32GB when stacking heavy ControlNet on FLUX. The extra 8GB matters for FLUX + 2x ControlNet, which is otherwise infeasible.
- Add a small CPU-only Postgres/Redis box for prompt history and LoRA metadata. Don’t co-locate on the GPU host; it competes for PCIe and disk I/O.
Multi-pipeline serving config
The pattern that works for studios mixing models: two pre-loaded pipelines per card, swapped via shared scheduler. Below is the FLUX FP8 load with torchao quantisation that keeps 14GB resident and leaves room for an SDXL pipeline alongside:
from diffusers import FluxPipeline, StableDiffusionXLPipeline
import torch
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
# FLUX FP8: 14GB resident
flux = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
quantize_(flux.transformer, float8_dynamic_activation_float8_weight())
flux.vae.to(torch.float16)
# SDXL: ~9GB resident, co-loaded
sdxl = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
# Total ~23GB; route by request type
def render(model_name, prompt, **kw):
pipe = flux if model_name == "flux" else sdxl
return pipe(prompt=prompt, **kw).images[0]
This consumes ~23GB resident — tight but workable for a 4090 dedicated to mixed traffic. For the API tier, separate cards by model class and load-balance externally. Full setup recipe in FLUX setup and Stable Diffusion setup.
Production gotchas
- VAE FP16 colour bug on SDXL. The default VAE produces saturated artefacts in FP16. Use the
madebyollin/sdxl-vae-fp16-fixcheckpoint or run VAE in FP32 (+150 ms). - FLUX VAE in FP8 produces colour drift. Always keep the VAE in FP16 even when transformer is FP8.
- ComfyUI workflows leak VRAM. Custom-node workflows can hold references to old pipelines; restart the worker every 200-500 images depending on workflow complexity.
- LoRA scale stacking is order-dependent. Loading LoRA A at 0.8 then B at 0.6 produces different output than B then A. Document the canonical order.
- ControlNet preprocessor on CPU is the bottleneck. Canny and depth preprocessing in PIL/OpenCV often takes 200-400ms. Move to GPU via controlnet-aux or run async ahead of generation.
- Don’t co-locate ComfyUI and an LLM on the same 4090. Designer workflows have unpredictable VRAM spikes that will OOM the LLM.
- NSFW filters add 80-150ms per image. If you run safety classification, batch-classify post-VAE rather than per-step.
Verdict: when to pick a 4090 for a studio
Pick the RTX 4090 24GB for a creative image studio when you need the full diffusion stack — SDXL with stacks, FLUX FP8, and SD 1.5 — on one card with comfortable headroom. The named 12-engineer workload runs three 4090s for daily volumes the team would otherwise pay £8,000+/month to a SaaS provider. Step down to the 5060 Ti 16GB only for solo creators or low-volume API endpoints; you lose FLUX.1-dev fit and meaningful batching. Step up to the 5090 32GB when FLUX + ControlNet is your main workload or when you want batch 8 at 1024 native SDXL. See best GPU for Stable Diffusion for the broader landscape.
Run SDXL and FLUX on one card
2,200 SDXL images per hour, FLUX.1-dev FP8 fits comfortably, ControlNet and LoRAs stack. UK dedicated hosting, predictable monthly cost.
Order the RTX 4090 24GBSee also: FLUX setup, Stable Diffusion setup, ComfyUI setup, SDXL benchmark, FLUX dev benchmark, FLUX schnell benchmark, Stable Video Diffusion, best GPU for SD, 4090 spec breakdown.