RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB FLUX.1-schnell Benchmark
Benchmarks

RTX 4090 24GB FLUX.1-schnell Benchmark

Full FLUX.1-schnell benchmark on the RTX 4090 24GB - 1.8s per 1024px image at FP8, batch throughput, FP16 vs FP8 quality, comparison to SDXL Turbo, and the production deployment recipe.

FLUX.1-schnell from Black Forest Labs is the Apache 2.0-licensed, 4-step distilled variant of the FLUX.1 family — the de facto standard for fast, license-clean image generation in 2026. On a single RTX 4090 24GB through Gigagpu’s UK dedicated hosting, FP8-quantised schnell renders a 1024×1024 image in 1.8 seconds at 4 steps. FP16 lands at 2.6 seconds. This page contains the full benchmark suite — per-step latency, batch throughput, VRAM headroom, FP8 vs FP16 quality differential, and the production launch config.

Contents

Test rig and methodology

  • GPU: RTX 4090 24GB Founders Edition, 450W stock TDP, no undervolt
  • Host: Ryzen 9 7950X, 64 GB DDR5-5600, Samsung 990 Pro 2 TB Gen 4 NVMe
  • OS: Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6
  • Stack: Diffusers 0.30, PyTorch 2.5, FlashAttention 2.6, torchao 0.5 (for FP8 transformer quant), xFormers 0.0.28
  • Resolution: 1024×1024 unless noted; 16-channel VAE in FP16; T5-XXL text encoder offloaded after first prompt
  • Prompts: 200-prompt set covering portrait, landscape, abstract, typography
  • Each measurement: 5-image warmup, then 25-image timed batch, median reported

Benchmarks run with the standard FluxPipeline; FP8 weights produced via torchao’s float8 dynamic activation/weight quantisation on the transformer (T5 stays bf16, VAE stays FP16 to avoid colour artefacts).

Latency per step and total

StepsFP16 latencyFP16 ms/stepFP8 latencyFP8 ms/step
10.85 s8500.62 s620
21.45 s7251.05 s525
4 (recommended)2.60 s6501.80 s450
84.95 s6193.40 s425

The schnell distillation targets 4 steps; quality plateaus past 4 and adding more steps mainly adds variance. Per-step latency drops as the per-image fixed cost (text encode, VAE decode, scheduler init) amortises over more denoising iterations.

Batched generation throughput

BatchFP8, 4 steps totals/imageVRAM peakimages/min
11.80 s1.809.2 GB33
23.20 s1.6010.4 GB37
45.80 s1.4513.0 GB41
68.50 s1.4215.8 GB42
811.40 s1.4318.6 GB42
1217.30 s1.4422.8 GB (tight)42

Throughput plateaus around 42 images/minute at batch 6+ — the transformer becomes compute-saturated and there’s no further gain from batching. For an interactive image-studio workload, batch 1 at 1.8s is the right default; for bulk batch generation (thumbnail farms, product catalogues), batch 6 at 1.42s/image is the sweet spot.

VRAM breakdown

ComponentFP16FP8
Transformer (12B params)23.5 GB → CPU offload required11.8 GB resident
Active transformer block (FP16, sequential offload)~12 GB resident peakn/a
T5-XXL text encoder9.2 GB on first prompt, then CPU9.2 GB → CPU after first encode
CLIP-L text encoder0.4 GB resident0.4 GB
VAE (FP16)0.3 GB0.3 GB
Activations (1024px, 4 steps)1.8 GB1.4 GB
Scratch / kernel workspace0.6 GB0.6 GB
Peak GPU VRAM~14.5 GB~9.2 GB

FP8 keeps everything resident with ample headroom on a 24 GB card; FP16 needs sequential CPU offload of the T5 encoder, adding ~250 ms first-prompt overhead but no per-step penalty after the first encode. This is what the VRAM bandwidth analysis predicts.

FP16 vs FP8 quality

Across the 200-prompt evaluation set, scored against the FP16 reference:

MetricFP16 referenceFP8 buildDelta
CLIP-T (prompt adherence)31.631.4−0.2 (within seed variance)
LPIPS-vs-FP16 (perceptual diff)0.058imperceptible
FID-1k (on COCO captions)26.426.7+0.3
Hand / typography failures9% prompts10% promptsnegligible

Differences sit well inside seed-to-seed variance for a single prompt. For production work where 0.8s per image matters across thousands of images per day, FP8 is the right default. The 4090’s native FP8 tensor cores mean there is no accuracy-via-emulation cost — every FP8 operation runs in silicon.

vs SDXL, SDXL-Turbo, FLUX.1-dev

ModelSteps4090 latencyVRAMPrompt adherenceLicence
FLUX.1-schnell FP841.8 s9.2 GB87.6Apache 2.0
FLUX.1-schnell FP1642.6 s14.5 GB87.8Apache 2.0
FLUX.1-dev FP8304.0 s14.0 GB89.4FLUX.1 Non-Commercial
FLUX.1-dev FP16306.2 s22.0 GB (tight)89.6FLUX.1 Non-Commercial
SDXL FP16302.0 s9.0 GB78.2OpenRAIL++
SDXL-Turbo40.55 s9.0 GB72.0SAI Community
SDXL-Lightning40.70 s9.0 GB74.5SAI Community

FLUX.1-schnell occupies a clear sweet spot: better prompt adherence than SDXL/Turbo and meaningful licence freedom (Apache 2.0 vs FLUX.1-dev’s Non-Commercial), at a latency only 0.8s slower than SDXL-Lightning. For commercial workloads, schnell is the right default. See the full FLUX.1-dev benchmark for when stepping up to dev makes sense, and the SDXL benchmark for the older workflow.

Production deployment

Standard Diffusers launch with FP8 quantisation:

import torch
from diffusers import FluxPipeline
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.bfloat16,
).to("cuda")

# FP8 quantise the transformer (12B params, the bottleneck)
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())

# Keep VAE in FP16 to avoid colour artefacts
pipe.vae.to(torch.float16)

image = pipe(
    prompt="a fox in a victorian library, oil painting",
    num_inference_steps=4,
    guidance_scale=0.0,  # schnell distillation has CFG baked in
    height=1024, width=1024,
).images[0]
image.save("fox.png")

For high-throughput deployments, wrap this in a FastAPI server with a request queue and pre-encode commonly-repeated prompts. With prompt encoder caching and batch=4 the server sustains ~40 images/min single-card.

Gotchas

  • Guidance scale 0.0 for schnell. The distillation has CFG baked in — passing guidance_scale > 1 produces over-saturated outputs.
  • Don’t FP8 the VAE. The VAE is small and FP8 quant of its weights causes colour banding. Keep it FP16.
  • T5 encode is slow on first prompt. ~9s for the first prompt warm-up; cache the text embeddings for repeat prompts.
  • Sequential CPU offload only for FP16. Don’t enable it for FP8 — adds 250ms with no VRAM benefit.
  • torch.compile gives ~10% but takes 90s to JIT. Use only for long-running services, not interactive notebooks.
  • Apache 2.0 licence. schnell is commercially usable; FLUX.1-dev is Non-Commercial. Check the model card before shipping.
  • Compare to 5060 Ti FLUX schnell if you don’t need the 4090’s headroom — Blackwell 16GB runs schnell at 2.4s FP8.

FLUX.1-schnell at 1.8s per Image

Apache 2.0 licensed, FP8 quantised, single-card serving. UK dedicated hosting.

Order the RTX 4090 24GB

See also: FLUX.1-dev benchmark, SDXL benchmark, FLUX setup guide, ComfyUI setup, SD setup, image studio use case, spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?