RTX 3050 - Order Now
Home / Blog / Use Cases / RTX 4090 24GB for Creative Image Generation Studio
Use Cases

RTX 4090 24GB for Creative Image Generation Studio

A creative studio backend on the RTX 4090 24GB: SDXL, FLUX.1-dev FP8, SD 1.5 + LoRAs and ControlNet pipelines. 12-engineer team workload, capacity tables, scaling triggers and production gotchas.

The RTX 4090 24GB is the sweet spot for serious diffusion work outside data-centre cards. It runs SDXL at 1.6 seconds per 1024 image batched, fits FLUX.1-dev FP8 at 4.1 seconds, and serves SD 1.5 with ControlNet and a stack of LoRAs from one UK GPU host. This post covers the named workload — a 12-engineer creative team running an internal image studio plus a public-facing API — the model lineup, capacity numbers and production gotchas we have seen across two years of deployments.

Contents

Named workload: 12-engineer studio

Reference workload: a UK product design agency with 12 creative engineers running an internal image studio (peak 4-6 concurrent designers iterating in ComfyUI), plus a public-facing customer endpoint exposing SDXL and FLUX schnell behind their own brand. Daily image volume in the last quarter: 14,800 internal renders, 38,400 customer-facing API renders. Peak hourly burst observed: 2,400 images.

SLA targets: internal designers want sub-3-second SDXL renders and sub-6-second FLUX renders for interactive iteration; the customer API has a soft 10-second wall-clock budget. Both must run with ControlNet stacks for the brand-consistency LoRA bundle (3 LoRAs always loaded, 1-2 ControlNets common).

This workload runs on three 4090s in production: one dedicated to interactive ComfyUI sessions, two pooled behind a least-loaded balancer for the API. Total cost is roughly 18% of the equivalent SaaS image API spend at their volume.

Model lineup and trade-offs

ModelPrecisionVRAMBest forLatency 1024 b=1
SDXL base + refinerFP16~10 GB1024-1536 photoreal, ControlNet2.7 s
SDXL Lightning 4-stepFP16~9 GBInteractive previews0.7 s
SDXL Turbo 4-stepFP16~9 GBRealtime sliders0.55 s
SD 1.5FP16~4 GBLoRA-heavy stylised, batch huge0.45 s
FLUX.1-dev FP8FP8~14 GBBest-in-class fidelity4.1 s
FLUX.1-dev FP16FP16~22 GBText-in-image, max quality6.2 s
FLUX.1-schnell FP8FP8~14 GB4-step FLUX quality1.8 s

The decision tree we recommend: SDXL for default 1024-pixel work because of mature ControlNet and LoRA ecosystem; FLUX.1-dev when you need state-of-the-art fidelity and ControlNet isn’t required; FLUX schnell for high-volume FLUX-quality serving; SD 1.5 only when you have brand-style LoRAs that don’t transfer up to SDXL.

Throughput benchmarks

Per-image wall-clock at default schedulers, Diffusers 0.30 with PyTorch 2.5, sdpa attention, on a stock 4090:

PipelineResolutionStepsLatency b=1Latency b=4img/min batched
SDXL base1024×102430 (DPM++ 2M)2.0 s6.5 s37
SDXL base + refiner1024×102430+102.7 s8.1 s30
SDXL Lightning1024×102440.7 s1.6 s150
SDXL Turbo1024×102440.55 s1.4 s171
SD 1.5768×768300.95 s2.6 s92
FLUX.1-dev FP81024×102430 (Euler)4.1 s12.4 s19
FLUX.1-schnell FP81024×102441.8 s5.6 s43

SDXL base at 1.6 s/image batched gives 2,200 images/hour per card. FLUX.1-dev FP8 at 3.1 s/image batched gives 1,160/hour with state-of-the-art fidelity. FLUX.1-schnell FP8 at 1.4 s/image is the practical default for high-volume FLUX-quality serving. SD 1.5 at 0.65 s/image batched can hit 5,500 images/hour for stylised low-resolution work.

FLUX.1-dev in 24 GB

FLUX.1-dev’s 12B-parameter MMDiT transformer at FP16 plus the T5-XXL text encoder needs careful VRAM accounting on a 24 GB card. The trick is sequential CPU offload of T5: encode the prompt, free T5 to system RAM, then load the transformer for denoising. Peak GPU VRAM lands at ~22 GB, leaving a safe 2 GB margin.

FP8 quantisation via torchao changes the calculus completely. The transformer drops to 11.8 GB resident with zero offload, leaving 12 GB free for batching, ControlNet, or co-located smaller models. FP8 also runs 1.5x faster than BF16 on the 4090’s native fourth-generation FP8 tensor cores. Quality measurements across 300 prompts show CLIP-T -0.2 and LPIPS 0.061 vs FP16 — within seed-to-seed variance for everything except very small text inside images. The full FLUX setup guide documents the recipe.

ControlNet, LoRA and IP-Adapter stacks

Real studio work always involves stacking. The 4090 has enough VRAM to handle aggressive composition:

StackSDXL latency b=1VRAMNotes
SDXL base2.0 s9.0 GBBaseline
+ 4 LoRAs (PEFT merge)2.05 s10.0 GBOne-off load cost only
+ ControlNet (Canny)2.6 s11.5 GB+30% per step
+ ControlNet (Canny + Depth)3.2 s13.5 GBTwo parallel paths
+ IP-Adapter Plus2.4 s11.2 GBImage conditioning
+ Refiner2.7 s11.0 GB10-step refiner
Studio default: base + refiner + 4 LoRAs + 1 ControlNet3.6 s14.5 GBBrand-consistent renders
Heavy stack: base + refiner + 4 LoRAs + 2 ControlNets + IP-Adapter4.4 s17.2 GBComplex compositions

LoRAs are nearly free per-image because PEFT merges low-rank deltas into the UNet weights at load time. ControlNet, by contrast, runs a parallel 1.3B network at every step — the +30% per ControlNet is unavoidable. The 4090’s 24GB headroom lets you stack all of this without OOM; on a 16GB card you would have to choose between batching and stacking.

Studio capacity and scaling triggers

WorkloadPer-4090 capacityNotes
SDXL 1024 photoreal (default stack)1,800 images/hourBatch 4 with ControlNet + LoRAs
FLUX.1-dev FP8 premium1,160 images/hourBatch 4, no ControlNet
FLUX.1-schnell FP8 high-volume2,560 images/hourBatch 4, 4-step
SD 1.5 brand-style with 5 LoRAs5,500 images/hourBatch 8 at 768px
SDXL Turbo realtime10,200 images/hourBatch 4, 4-step
Concurrent designers (interactive)4-6 active sessionsMixed pipeline

Scaling triggers for the named 12-engineer studio:

  • Add a card at 5,000+ daily renders sustained. One 4090 covers ~30,000 SDXL/day comfortably; beyond that latency starts breaking SLA.
  • Split FLUX traffic to a dedicated card at 600+ FLUX renders/day. Mixed FLUX/SDXL on one card means cold-swap penalties (8-12 seconds) hurt interactive UX.
  • Promote to 5090 32GB when stacking heavy ControlNet on FLUX. The extra 8GB matters for FLUX + 2x ControlNet, which is otherwise infeasible.
  • Add a small CPU-only Postgres/Redis box for prompt history and LoRA metadata. Don’t co-locate on the GPU host; it competes for PCIe and disk I/O.

Multi-pipeline serving config

The pattern that works for studios mixing models: two pre-loaded pipelines per card, swapped via shared scheduler. Below is the FLUX FP8 load with torchao quantisation that keeps 14GB resident and leaves room for an SDXL pipeline alongside:

from diffusers import FluxPipeline, StableDiffusionXLPipeline
import torch
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight

# FLUX FP8: 14GB resident
flux = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
quantize_(flux.transformer, float8_dynamic_activation_float8_weight())
flux.vae.to(torch.float16)

# SDXL: ~9GB resident, co-loaded
sdxl = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")

# Total ~23GB; route by request type
def render(model_name, prompt, **kw):
    pipe = flux if model_name == "flux" else sdxl
    return pipe(prompt=prompt, **kw).images[0]

This consumes ~23GB resident — tight but workable for a 4090 dedicated to mixed traffic. For the API tier, separate cards by model class and load-balance externally. Full setup recipe in FLUX setup and Stable Diffusion setup.

Production gotchas

  • VAE FP16 colour bug on SDXL. The default VAE produces saturated artefacts in FP16. Use the madebyollin/sdxl-vae-fp16-fix checkpoint or run VAE in FP32 (+150 ms).
  • FLUX VAE in FP8 produces colour drift. Always keep the VAE in FP16 even when transformer is FP8.
  • ComfyUI workflows leak VRAM. Custom-node workflows can hold references to old pipelines; restart the worker every 200-500 images depending on workflow complexity.
  • LoRA scale stacking is order-dependent. Loading LoRA A at 0.8 then B at 0.6 produces different output than B then A. Document the canonical order.
  • ControlNet preprocessor on CPU is the bottleneck. Canny and depth preprocessing in PIL/OpenCV often takes 200-400ms. Move to GPU via controlnet-aux or run async ahead of generation.
  • Don’t co-locate ComfyUI and an LLM on the same 4090. Designer workflows have unpredictable VRAM spikes that will OOM the LLM.
  • NSFW filters add 80-150ms per image. If you run safety classification, batch-classify post-VAE rather than per-step.

Verdict: when to pick a 4090 for a studio

Pick the RTX 4090 24GB for a creative image studio when you need the full diffusion stack — SDXL with stacks, FLUX FP8, and SD 1.5 — on one card with comfortable headroom. The named 12-engineer workload runs three 4090s for daily volumes the team would otherwise pay £8,000+/month to a SaaS provider. Step down to the 5060 Ti 16GB only for solo creators or low-volume API endpoints; you lose FLUX.1-dev fit and meaningful batching. Step up to the 5090 32GB when FLUX + ControlNet is your main workload or when you want batch 8 at 1024 native SDXL. See best GPU for Stable Diffusion for the broader landscape.

Run SDXL and FLUX on one card

2,200 SDXL images per hour, FLUX.1-dev FP8 fits comfortably, ControlNet and LoRAs stack. UK dedicated hosting, predictable monthly cost.

Order the RTX 4090 24GB

See also: FLUX setup, Stable Diffusion setup, ComfyUI setup, SDXL benchmark, FLUX dev benchmark, FLUX schnell benchmark, Stable Video Diffusion, best GPU for SD, 4090 spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?