RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB FLUX.1-dev Benchmark: 30-step in 4 Seconds FP8
Benchmarks

RTX 4090 24GB FLUX.1-dev Benchmark: 30-step in 4 Seconds FP8

FLUX.1-dev FP16 just fits a single RTX 4090 24GB at 22GB peak with 30-step renders in 6.2s; FP8 drops to 14GB and 4.1s. Per-step latency, batch tables, FP8 quality analysis, and cross-card comparison.

FLUX.1-dev is the un-distilled, full-precision rectified-flow model from Black Forest Labs — strictly higher quality than schnell at the cost of more sampling steps. Its 12 billion-parameter MMDiT transformer produces the best open-weight image quality available in early 2026, but at the price of needing real care on a 24GB card. On an RTX 4090 24GB from Gigagpu, FP16 dev just fits with 22 GB peak VRAM and renders a 30-step 1024×1024 image in 6.2 seconds. FP8 quantisation via torchao drops the footprint to 14 GB and the latency to 4.1 seconds with quality differences that are within seed-to-seed variance.

Contents

Methodology and test rig

All numbers come from a stock RTX 4090 24GB Founders Edition at 450W TDP on a Ryzen 9 7950X with 64GB DDR5-5600 and a Samsung 990 Pro 2TB Gen 4 NVMe. OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Inference uses Diffusers 0.30 with PyTorch 2.5; FP8 quantisation comes from torchao’s float8_dynamic_activation_float8_weight applied selectively to the transformer (VAE stays in FP16, T5 stays in BF16).

Each result is the median of 30 runs after a 5-image warm-up, sampling guidance scale 3.5 and the default Euler scheduler. Latency includes T5-XXL text encoding, transformer denoising and VAE decode but excludes PIL save and any network egress. Standard deviation across runs was under 4% at all measured points.

VRAM headroom: FP16 vs FP8

FLUX.1-dev’s transformer holds 12 billion parameters in 24 transformer blocks, plus the T5-XXL text encoder (4.7B params, normally too large to keep on-card alongside the transformer). On a 24GB card you have to make explicit choices about what stays resident:

ComponentFP16 pathFP8 path
Transformer (12B)23.5 GB if resident11.8 GB resident
Active block (sequential offload)~16 GB residentn/a (no offload needed)
T5-XXL encoder0 GB on GPU during diffusion0 GB on GPU during diffusion
T5-XXL during text encode9.4 GB transient9.4 GB transient
Activations (1024 px, b=1)3.0 GB2.0 GB
VAE decode peak0.5 GB0.5 GB
Peak GPU VRAM~22.0 GB~14.0 GB

FP16 dev requires sequential CPU offload of the T5 encoder between text encoding and denoising; otherwise you OOM. The offload itself adds about 250 ms per generation as 9.4 GB of weights move over the PCIe 4.0 x16 link (sustained ~28 GB/s real-world). FP8 keeps everything resident with comfortable headroom for batching and ControlNet adapters.

Latency by step count

FLUX is rectified flow, so step count maps almost linearly to latency. The dev variant’s recommended sampling range is 28-50 steps; 30 is the production default:

StepsFP16 latencyFP8 latencyFP8 steps/sFP8 vs FP16
10.85 s0.55 s1.81.55x faster
42.6 s1.85 s2.21.41x
102.5 s1.7 s5.91.47x
204.4 s2.9 s6.91.52x
306.2 s4.1 s7.31.51x
408.2 s5.4 s7.41.52x
5010.3 s6.8 s7.41.51x

The per-step cost stabilises at ~135 ms FP8 / ~205 ms FP16 once the fixed text-encoding overhead is amortised. FP8 buys roughly 1.5x throughput across the range — better than the typical ~1.2x you see on bandwidth-bound LLM workloads, because FLUX’s transformer is heavy on FP-precision matmul that genuinely benefits from the 4090’s native FP8 fourth-generation tensor cores.

Batched generation throughput

FP8 dev with 14GB resident weights leaves 10GB for activations and KV. Batching scales well until activation memory dominates:

BatchFP8 30-step latencys/imageimg/minVRAM
14.1 s4.1014.614.0 GB
26.8 s3.4017.616.8 GB
39.5 s3.1718.919.5 GB
412.4 s3.1019.422.5 GB (tight)
5+OOM>24 GB

Batch 4 is the practical ceiling on a 24GB card with FP8 dev. Per-image cost falls from 4.1s to 3.1s — a meaningful 24% improvement that pays off any time you have queue depth. For studio workloads where prompts arrive in a steady trickle rather than batches, run a small queue with 200 ms accumulation window and dispatch at batch 2-4 depending on depth.

FP8 vs FP16 quality measurement

The interesting question is whether FP8 quantisation costs you visible quality. We ran 300 prompts from a held-out evaluation set covering portraits, landscapes, complex compositions and text-in-image. Each prompt was rendered at the same seed under both FP16 and FP8, then scored:

MetricFP16 devFP8 devDeltaSignificance
CLIP-T (prompt adherence)31.631.4-0.2Within seed variance ±0.4
LPIPS vs FP16 reference0.061Below human discriminability ~0.10
FID-100k vs reference set21.421.6+0.2Within FID noise ±0.5
HPSv2 (preference)0.2730.272-0.001Tied
Text-in-image accuracy (50 prompts)78%74%-4ppMarginally worse on small text

Across general image generation, FP8 is indistinguishable from FP16 at the population level. The one consistent regression is rendering of small text inside images — a known weakness of FP8 quantisation that affects the very-fine tokens FLUX uses for typographic details. If you render typography-heavy outputs (posters, signage) keep FP16; otherwise FP8 is the right default.

FLUX.1-dev vs FLUX.1-schnell

FLUX.1-schnell is the 4-step distilled variant of the same backbone. It is faster but lower fidelity — and on dev’s home turf of complex compositions, the gap is real:

Metricdev FP8 30-stepschnell FP8 4-step
Latency b=14.1 s1.85 s
Latency b=412.4 s (3.1 s/img)5.6 s (1.4 s/img)
VRAM14.0 GB14.5 GB
CLIP-T prompt adherence31.430.6
HPSv2 preference0.2720.241
Detail / micro-textureHigherLower, occasional softness
Use caseFinal rendersDrafts, previews, bulk

The pattern that works in production: schnell for first-pass exploration and bulk thumbnails, dev for the final render once the user has chosen a direction. Both share the same VAE and conditioning, so swapping is one line of code at inference time.

Cross-card comparison

FLUX.1-dev FP8 30-step at 1024×1024:

GPUVRAMFP8 path?30-step latency30-step b=4Notes
RTX 5060 Ti 16GB16Yes9.8 sOOMFP8 fits, batch doesn’t
RTX 5080 16GB16Yes5.1 sOOMFaster but same VRAM cap
RTX 3090 24GB24No (BF16 only)8.4 s~26 sNo native FP8 tensor cores
RTX 4090 24GB24Yes4.1 s12.4 sSweet spot
RTX 5090 32GB32Yes (5th gen)2.4 s7.0 sBatch 6+ comfortable
H100 80GB80Yes1.9 s5.4 sTrivial; serves dev FP16 batch 8

The 3090 is the cautionary tale: identical 24GB VRAM but no FP8 tensor cores, so you’re stuck on BF16 with 2x the latency. The 4090’s native FP8 fourth-generation tensor cores are exactly what FLUX needs. The 5090 is roughly 1.7x faster on this workload — worth it if FLUX is your dominant traffic.

Configuration and code

The reference FP8 setup with torchao quantisation:

from diffusers import FluxPipeline
import torch
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
).to("cuda")

# Quantise transformer to FP8; keep VAE FP16 to avoid colour drift
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.vae.to(torch.float16)

img = pipe(
    prompt="a cinematic portrait of a fox in autumn forest, soft light",
    num_inference_steps=30,
    guidance_scale=3.5,
    height=1024, width=1024,
).images[0]

For FP16 dev with sequential CPU offload of T5, replace the quantisation block with pipe.enable_sequential_cpu_offload(). That alone shaves nothing off latency but is what allows FP16 to fit at all. Avoid enable_model_cpu_offload — it offloads the transformer too, which hits PCIe and adds ~1.5 seconds per generation.

Production gotchas

  • VAE in FP8 produces colour drift. FLUX’s VAE is sensitive to FP8 quantisation and produces visible green/magenta shifts. Always keep the VAE in FP16 even when the transformer is FP8.
  • T5 offload latency is variable. On systems with other PCIe traffic (NVMe writes, network), T5 offload can spike from 250ms to 800ms. Pin the GPU and avoid co-located heavy I/O during inference.
  • Guidance scale 3.5 is the FLUX default, not 7. Inheriting SDXL/SD habits and pushing guidance to 7+ produces oversaturated, plastic-looking outputs on FLUX. Stay between 2.5 and 4.0.
  • Negative prompts are ignored. FLUX is trained without classifier-free guidance in the SDXL sense; passing a negative prompt has no effect. Engineer the positive prompt instead.
  • torchao FP8 needs PyTorch 2.4+. Older PyTorch silently falls back to BF16 with no warning, and you’ll wonder why your latency is double. Pin torch>=2.5 in requirements.
  • Aspect ratios outside training distribution degrade. FLUX was trained at specific bucket resolutions; rendering at, say, 1280×320 produces collapsed compositions. Stick to the documented bucket list.
  • First inference after FP8 quantisation is slow. torchao does kernel selection on first call (~8 seconds). Pre-warm with a dummy generation before opening to traffic.

Verdict: when to pick FLUX.1-dev on a 4090

Pick FLUX.1-dev FP8 on a 4090 when image fidelity is the customer-facing metric and 4 seconds per image is acceptable latency. It produces visibly better composition, anatomy and text rendering than SDXL, particularly on complex multi-subject scenes. Pick FP16 dev only when you absolutely need text-in-image accuracy. Step down to FLUX schnell for previews and high-volume drafts, or to SDXL when you need ControlNet (the FLUX ControlNet ecosystem is still maturing). Step up to a 5090 32GB only if FLUX is the dominant traffic and you need batch 6+ or FP16 dev as the default.

FLUX.1-dev at 4 seconds per image

FP8 quantised, 30-step quality, 14GB resident on UK 4090 hosts. Up to 19 images per minute batched.

Order the RTX 4090 24GB

See also: FLUX setup, FLUX schnell benchmark, SDXL benchmark, ComfyUI setup, Stable Diffusion setup, image studio use case, 4090 spec breakdown, 4090 vs 5090.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?