Home / Blog / Benchmarks / RTX 4090 24GB FLUX.1-schnell Benchmark

Benchmarks

RTX 4090 24GB FLUX.1-schnell Benchmark

Full FLUX.1-schnell benchmark on the RTX 4090 24GB - 1.8s per 1024px image at FP8, batch throughput, FP16 vs FP8 quality, comparison to SDXL Turbo, and the production deployment recipe.

Benchmarks May 4, 2026 4 min read gigagpu

FLUX.1-schnell from Black Forest Labs is the Apache 2.0-licensed, 4-step distilled variant of the FLUX.1 family — the de facto standard for fast, license-clean image generation in 2026. On a single RTX 4090 24GB through Gigagpu’s UK dedicated hosting, FP8-quantised schnell renders a 1024×1024 image in 1.8 seconds at 4 steps. FP16 lands at 2.6 seconds. This page contains the full benchmark suite — per-step latency, batch throughput, VRAM headroom, FP8 vs FP16 quality differential, and the production launch config.

Test rig and methodology

GPU: RTX 4090 24GB Founders Edition, 450W stock TDP, no undervolt
Host: Ryzen 9 7950X, 64 GB DDR5-5600, Samsung 990 Pro 2 TB Gen 4 NVMe
OS: Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6
Stack: Diffusers 0.30, PyTorch 2.5, FlashAttention 2.6, torchao 0.5 (for FP8 transformer quant), xFormers 0.0.28
Resolution: 1024×1024 unless noted; 16-channel VAE in FP16; T5-XXL text encoder offloaded after first prompt
Prompts: 200-prompt set covering portrait, landscape, abstract, typography
Each measurement: 5-image warmup, then 25-image timed batch, median reported

Benchmarks run with the standard FluxPipeline; FP8 weights produced via torchao’s float8 dynamic activation/weight quantisation on the transformer (T5 stays bf16, VAE stays FP16 to avoid colour artefacts).

Latency per step and total

Steps	FP16 latency	FP16 ms/step	FP8 latency	FP8 ms/step
1	0.85 s	850	0.62 s	620
2	1.45 s	725	1.05 s	525
4 (recommended)	2.60 s	650	1.80 s	450
8	4.95 s	619	3.40 s	425

The schnell distillation targets 4 steps; quality plateaus past 4 and adding more steps mainly adds variance. Per-step latency drops as the per-image fixed cost (text encode, VAE decode, scheduler init) amortises over more denoising iterations.

Batched generation throughput

Batch	FP8, 4 steps total	s/image	VRAM peak	images/min
1	1.80 s	1.80	9.2 GB	33
2	3.20 s	1.60	10.4 GB	37
4	5.80 s	1.45	13.0 GB	41
6	8.50 s	1.42	15.8 GB	42
8	11.40 s	1.43	18.6 GB	42
12	17.30 s	1.44	22.8 GB (tight)	42

Throughput plateaus around 42 images/minute at batch 6+ — the transformer becomes compute-saturated and there’s no further gain from batching. For an interactive image-studio workload, batch 1 at 1.8s is the right default; for bulk batch generation (thumbnail farms, product catalogues), batch 6 at 1.42s/image is the sweet spot.

VRAM breakdown

Component	FP16	FP8
Transformer (12B params)	23.5 GB → CPU offload required	11.8 GB resident
Active transformer block (FP16, sequential offload)	~12 GB resident peak	n/a
T5-XXL text encoder	9.2 GB on first prompt, then CPU	9.2 GB → CPU after first encode
CLIP-L text encoder	0.4 GB resident	0.4 GB
VAE (FP16)	0.3 GB	0.3 GB
Activations (1024px, 4 steps)	1.8 GB	1.4 GB
Scratch / kernel workspace	0.6 GB	0.6 GB
Peak GPU VRAM	~14.5 GB	~9.2 GB

FP8 keeps everything resident with ample headroom on a 24 GB card; FP16 needs sequential CPU offload of the T5 encoder, adding ~250 ms first-prompt overhead but no per-step penalty after the first encode. This is what the VRAM bandwidth analysis predicts.

FP16 vs FP8 quality

Across the 200-prompt evaluation set, scored against the FP16 reference:

Metric	FP16 reference	FP8 build	Delta
CLIP-T (prompt adherence)	31.6	31.4	−0.2 (within seed variance)
LPIPS-vs-FP16 (perceptual diff)	—	0.058	imperceptible
FID-1k (on COCO captions)	26.4	26.7	+0.3
Hand / typography failures	9% prompts	10% prompts	negligible

Differences sit well inside seed-to-seed variance for a single prompt. For production work where 0.8s per image matters across thousands of images per day, FP8 is the right default. The 4090’s native FP8 tensor cores mean there is no accuracy-via-emulation cost — every FP8 operation runs in silicon.

vs SDXL, SDXL-Turbo, FLUX.1-dev

Model	Steps	4090 latency	VRAM	Prompt adherence	Licence
FLUX.1-schnell FP8	4	1.8 s	9.2 GB	87.6	Apache 2.0
FLUX.1-schnell FP16	4	2.6 s	14.5 GB	87.8	Apache 2.0
FLUX.1-dev FP8	30	4.0 s	14.0 GB	89.4	FLUX.1 Non-Commercial
FLUX.1-dev FP16	30	6.2 s	22.0 GB (tight)	89.6	FLUX.1 Non-Commercial
SDXL FP16	30	2.0 s	9.0 GB	78.2	OpenRAIL++
SDXL-Turbo	4	0.55 s	9.0 GB	72.0	SAI Community
SDXL-Lightning	4	0.70 s	9.0 GB	74.5	SAI Community

FLUX.1-schnell occupies a clear sweet spot: better prompt adherence than SDXL/Turbo and meaningful licence freedom (Apache 2.0 vs FLUX.1-dev’s Non-Commercial), at a latency only 0.8s slower than SDXL-Lightning. For commercial workloads, schnell is the right default. See the full FLUX.1-dev benchmark for when stepping up to dev makes sense, and the SDXL benchmark for the older workflow.

Production deployment

Standard Diffusers launch with FP8 quantisation:

import torch
from diffusers import FluxPipeline
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.bfloat16,
).to("cuda")

# FP8 quantise the transformer (12B params, the bottleneck)
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())

# Keep VAE in FP16 to avoid colour artefacts
pipe.vae.to(torch.float16)

image = pipe(
    prompt="a fox in a victorian library, oil painting",
    num_inference_steps=4,
    guidance_scale=0.0,  # schnell distillation has CFG baked in
    height=1024, width=1024,
).images[0]
image.save("fox.png")

For high-throughput deployments, wrap this in a FastAPI server with a request queue and pre-encode commonly-repeated prompts. With prompt encoder caching and batch=4 the server sustains ~40 images/min single-card.

Gotchas

Guidance scale 0.0 for schnell. The distillation has CFG baked in — passing guidance_scale > 1 produces over-saturated outputs.
Don’t FP8 the VAE. The VAE is small and FP8 quant of its weights causes colour banding. Keep it FP16.
T5 encode is slow on first prompt. ~9s for the first prompt warm-up; cache the text embeddings for repeat prompts.
Sequential CPU offload only for FP16. Don’t enable it for FP8 — adds 250ms with no VRAM benefit.
torch.compile gives ~10% but takes 90s to JIT. Use only for long-running services, not interactive notebooks.
Apache 2.0 licence. schnell is commercially usable; FLUX.1-dev is Non-Commercial. Check the model card before shipping.
Compare to 5060 Ti FLUX schnell if you don’t need the 4090’s headroom — Blackwell 16GB runs schnell at 2.4s FP8.

FLUX.1-schnell at 1.8s per Image

Apache 2.0 licensed, FP8 quantised, single-card serving. UK dedicated hosting.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB FLUX.1-schnell Benchmark

Contents

Test rig and methodology

Latency per step and total

Batched generation throughput

VRAM breakdown

FP16 vs FP8 quality

vs SDXL, SDXL-Turbo, FLUX.1-dev

Production deployment

Gotchas

FLUX.1-schnell at 1.8s per Image

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB FLUX.1-schnell Benchmark

Contents

Test rig and methodology

Latency per step and total

Batched generation throughput

VRAM breakdown

FP16 vs FP8 quality

vs SDXL, SDXL-Turbo, FLUX.1-dev

Production deployment

Gotchas

FLUX.1-schnell at 1.8s per Image

Need a Dedicated GPU Server?

gigagpu

Related Articles

Fine-Tuning Throughput on the RTX 5060 Ti 16 GB: Tokens per Second by Method

Whisper Medium RTF by GPU

YOLOv8 on RTX 5080: Detection FPS & Cost, Category: Benchmarks, Slug: yolov8-on-rtx-5080-benchmark, Excerpt: YOLOv8 benchmarked on RTX 5080: 115 FPS, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?