RTX 4090 24GB for Creative Image Generation Studio GIGAGPU

The RTX 4090 24GB is the sweet spot for serious diffusion work outside data-centre cards. It runs SDXL at 1.6 seconds per 1024 image batched, fits FLUX.1-dev FP8 at 4.1 seconds, and serves SD 1.5 with ControlNet and a stack of LoRAs from one UK GPU host. This post covers the named workload — a 12-engineer creative team running an internal image studio plus a public-facing API — the model lineup, capacity numbers and production gotchas we have seen across two years of deployments.

Named workload: 12-engineer studio

Reference workload: a UK product design agency with 12 creative engineers running an internal image studio (peak 4-6 concurrent designers iterating in ComfyUI), plus a public-facing customer endpoint exposing SDXL and FLUX schnell behind their own brand. Daily image volume in the last quarter: 14,800 internal renders, 38,400 customer-facing API renders. Peak hourly burst observed: 2,400 images.

SLA targets: internal designers want sub-3-second SDXL renders and sub-6-second FLUX renders for interactive iteration; the customer API has a soft 10-second wall-clock budget. Both must run with ControlNet stacks for the brand-consistency LoRA bundle (3 LoRAs always loaded, 1-2 ControlNets common).

This workload runs on three 4090s in production: one dedicated to interactive ComfyUI sessions, two pooled behind a least-loaded balancer for the API. Total cost is roughly 18% of the equivalent SaaS image API spend at their volume.

Model lineup and trade-offs

Model	Precision	VRAM	Best for	Latency 1024 b=1
SDXL base + refiner	FP16	~10 GB	1024-1536 photoreal, ControlNet	2.7 s
SDXL Lightning 4-step	FP16	~9 GB	Interactive previews	0.7 s
SDXL Turbo 4-step	FP16	~9 GB	Realtime sliders	0.55 s
SD 1.5	FP16	~4 GB	LoRA-heavy stylised, batch huge	0.45 s
FLUX.1-dev FP8	FP8	~14 GB	Best-in-class fidelity	4.1 s
FLUX.1-dev FP16	FP16	~22 GB	Text-in-image, max quality	6.2 s
FLUX.1-schnell FP8	FP8	~14 GB	4-step FLUX quality	1.8 s

The decision tree we recommend: SDXL for default 1024-pixel work because of mature ControlNet and LoRA ecosystem; FLUX.1-dev when you need state-of-the-art fidelity and ControlNet isn’t required; FLUX schnell for high-volume FLUX-quality serving; SD 1.5 only when you have brand-style LoRAs that don’t transfer up to SDXL.

Throughput benchmarks

Per-image wall-clock at default schedulers, Diffusers 0.30 with PyTorch 2.5, sdpa attention, on a stock 4090:

Pipeline	Resolution	Steps	Latency b=1	Latency b=4	img/min batched
SDXL base	1024×1024	30 (DPM++ 2M)	2.0 s	6.5 s	37
SDXL base + refiner	1024×1024	30+10	2.7 s	8.1 s	30
SDXL Lightning	1024×1024	4	0.7 s	1.6 s	150
SDXL Turbo	1024×1024	4	0.55 s	1.4 s	171
SD 1.5	768×768	30	0.95 s	2.6 s	92
FLUX.1-dev FP8	1024×1024	30 (Euler)	4.1 s	12.4 s	19
FLUX.1-schnell FP8	1024×1024	4	1.8 s	5.6 s	43

SDXL base at 1.6 s/image batched gives 2,200 images/hour per card. FLUX.1-dev FP8 at 3.1 s/image batched gives 1,160/hour with state-of-the-art fidelity. FLUX.1-schnell FP8 at 1.4 s/image is the practical default for high-volume FLUX-quality serving. SD 1.5 at 0.65 s/image batched can hit 5,500 images/hour for stylised low-resolution work.

FLUX.1-dev in 24 GB

FLUX.1-dev’s 12B-parameter MMDiT transformer at FP16 plus the T5-XXL text encoder needs careful VRAM accounting on a 24 GB card. The trick is sequential CPU offload of T5: encode the prompt, free T5 to system RAM, then load the transformer for denoising. Peak GPU VRAM lands at ~22 GB, leaving a safe 2 GB margin.

FP8 quantisation via torchao changes the calculus completely. The transformer drops to 11.8 GB resident with zero offload, leaving 12 GB free for batching, ControlNet, or co-located smaller models. FP8 also runs 1.5x faster than BF16 on the 4090’s native fourth-generation FP8 tensor cores. Quality measurements across 300 prompts show CLIP-T -0.2 and LPIPS 0.061 vs FP16 — within seed-to-seed variance for everything except very small text inside images. The full FLUX setup guide documents the recipe.

ControlNet, LoRA and IP-Adapter stacks

Real studio work always involves stacking. The 4090 has enough VRAM to handle aggressive composition:

Stack	SDXL latency b=1	VRAM	Notes
SDXL base	2.0 s	9.0 GB	Baseline
+ 4 LoRAs (PEFT merge)	2.05 s	10.0 GB	One-off load cost only
+ ControlNet (Canny)	2.6 s	11.5 GB	+30% per step
+ ControlNet (Canny + Depth)	3.2 s	13.5 GB	Two parallel paths
+ IP-Adapter Plus	2.4 s	11.2 GB	Image conditioning
+ Refiner	2.7 s	11.0 GB	10-step refiner
Studio default: base + refiner + 4 LoRAs + 1 ControlNet	3.6 s	14.5 GB	Brand-consistent renders
Heavy stack: base + refiner + 4 LoRAs + 2 ControlNets + IP-Adapter	4.4 s	17.2 GB	Complex compositions

LoRAs are nearly free per-image because PEFT merges low-rank deltas into the UNet weights at load time. ControlNet, by contrast, runs a parallel 1.3B network at every step — the +30% per ControlNet is unavoidable. The 4090’s 24GB headroom lets you stack all of this without OOM; on a 16GB card you would have to choose between batching and stacking.

Studio capacity and scaling triggers

Workload	Per-4090 capacity	Notes
SDXL 1024 photoreal (default stack)	1,800 images/hour	Batch 4 with ControlNet + LoRAs
FLUX.1-dev FP8 premium	1,160 images/hour	Batch 4, no ControlNet
FLUX.1-schnell FP8 high-volume	2,560 images/hour	Batch 4, 4-step
SD 1.5 brand-style with 5 LoRAs	5,500 images/hour	Batch 8 at 768px
SDXL Turbo realtime	10,200 images/hour	Batch 4, 4-step
Concurrent designers (interactive)	4-6 active sessions	Mixed pipeline

Scaling triggers for the named 12-engineer studio:

Add a card at 5,000+ daily renders sustained. One 4090 covers ~30,000 SDXL/day comfortably; beyond that latency starts breaking SLA.
Split FLUX traffic to a dedicated card at 600+ FLUX renders/day. Mixed FLUX/SDXL on one card means cold-swap penalties (8-12 seconds) hurt interactive UX.
Promote to 5090 32GB when stacking heavy ControlNet on FLUX. The extra 8GB matters for FLUX + 2x ControlNet, which is otherwise infeasible.
Add a small CPU-only Postgres/Redis box for prompt history and LoRA metadata. Don’t co-locate on the GPU host; it competes for PCIe and disk I/O.

Multi-pipeline serving config

The pattern that works for studios mixing models: two pre-loaded pipelines per card, swapped via shared scheduler. Below is the FLUX FP8 load with torchao quantisation that keeps 14GB resident and leaves room for an SDXL pipeline alongside:

from diffusers import FluxPipeline, StableDiffusionXLPipeline
import torch
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight

# FLUX FP8: 14GB resident
flux = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
quantize_(flux.transformer, float8_dynamic_activation_float8_weight())
flux.vae.to(torch.float16)

# SDXL: ~9GB resident, co-loaded
sdxl = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")

# Total ~23GB; route by request type
def render(model_name, prompt, **kw):
    pipe = flux if model_name == "flux" else sdxl
    return pipe(prompt=prompt, **kw).images[0]

This consumes ~23GB resident — tight but workable for a 4090 dedicated to mixed traffic. For the API tier, separate cards by model class and load-balance externally. Full setup recipe in FLUX setup and Stable Diffusion setup.

Production gotchas

VAE FP16 colour bug on SDXL. The default VAE produces saturated artefacts in FP16. Use the madebyollin/sdxl-vae-fp16-fix checkpoint or run VAE in FP32 (+150 ms).
FLUX VAE in FP8 produces colour drift. Always keep the VAE in FP16 even when transformer is FP8.
ComfyUI workflows leak VRAM. Custom-node workflows can hold references to old pipelines; restart the worker every 200-500 images depending on workflow complexity.
LoRA scale stacking is order-dependent. Loading LoRA A at 0.8 then B at 0.6 produces different output than B then A. Document the canonical order.
ControlNet preprocessor on CPU is the bottleneck. Canny and depth preprocessing in PIL/OpenCV often takes 200-400ms. Move to GPU via controlnet-aux or run async ahead of generation.
Don’t co-locate ComfyUI and an LLM on the same 4090. Designer workflows have unpredictable VRAM spikes that will OOM the LLM.
NSFW filters add 80-150ms per image. If you run safety classification, batch-classify post-VAE rather than per-step.

Verdict: when to pick a 4090 for a studio

Pick the RTX 4090 24GB for a creative image studio when you need the full diffusion stack — SDXL with stacks, FLUX FP8, and SD 1.5 — on one card with comfortable headroom. The named 12-engineer workload runs three 4090s for daily volumes the team would otherwise pay £8,000+/month to a SaaS provider. Step down to the 5060 Ti 16GB only for solo creators or low-volume API endpoints; you lose FLUX.1-dev fit and meaningful batching. Step up to the 5090 32GB when FLUX + ControlNet is your main workload or when you want batch 8 at 1024 native SDXL. See best GPU for Stable Diffusion for the broader landscape.

Run SDXL and FLUX on one card

2,200 SDXL images per hour, FLUX.1-dev FP8 fits comfortably, ControlNet and LoRAs stack. UK dedicated hosting, predictable monthly cost.

Order the RTX 4090 24GB

RTX 4090 24GB for Creative Image Generation Studio

Contents

Named workload: 12-engineer studio

Model lineup and trade-offs

Throughput benchmarks

FLUX.1-dev in 24 GB

ControlNet, LoRA and IP-Adapter stacks

Studio capacity and scaling triggers

Multi-pipeline serving config

Production gotchas

Verdict: when to pick a 4090 for a studio

Run SDXL and FLUX on one card

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Creative Image Generation Studio

Contents

Named workload: 12-engineer studio

Model lineup and trade-offs

Throughput benchmarks

FLUX.1-dev in 24 GB

ControlNet, LoRA and IP-Adapter stacks

Studio capacity and scaling triggers

Multi-pipeline serving config

Production gotchas

Verdict: when to pick a 4090 for a studio

Run SDXL and FLUX on one card

Need a Dedicated GPU Server?

gigagpu

Related Articles

LLaMA 3 8B for Video Surveillance Analytics: GPU Requirements & Setup

Coqui TTS for Accessibility & Screen Reader: GPU Requirements & Setup

Build an AI-Powered Form Filler on GPU

Stable Diffusion for Marketing Content: GPU Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?