FLUX.1-schnell from Black Forest Labs is the Apache 2.0-licensed, 4-step distilled variant of the FLUX.1 family — the de facto standard for fast, license-clean image generation in 2026. On a single RTX 4090 24GB through Gigagpu’s UK dedicated hosting, FP8-quantised schnell renders a 1024×1024 image in 1.8 seconds at 4 steps. FP16 lands at 2.6 seconds. This page contains the full benchmark suite — per-step latency, batch throughput, VRAM headroom, FP8 vs FP16 quality differential, and the production launch config.
Contents
- Test rig and methodology
- Latency per step and total
- Batched generation throughput
- VRAM breakdown
- FP16 vs FP8 quality
- vs SDXL, SDXL-Turbo, FLUX dev
- Production deployment
- Gotchas
Test rig and methodology
- GPU: RTX 4090 24GB Founders Edition, 450W stock TDP, no undervolt
- Host: Ryzen 9 7950X, 64 GB DDR5-5600, Samsung 990 Pro 2 TB Gen 4 NVMe
- OS: Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6
- Stack: Diffusers 0.30, PyTorch 2.5, FlashAttention 2.6, torchao 0.5 (for FP8 transformer quant), xFormers 0.0.28
- Resolution: 1024×1024 unless noted; 16-channel VAE in FP16; T5-XXL text encoder offloaded after first prompt
- Prompts: 200-prompt set covering portrait, landscape, abstract, typography
- Each measurement: 5-image warmup, then 25-image timed batch, median reported
Benchmarks run with the standard FluxPipeline; FP8 weights produced via torchao’s float8 dynamic activation/weight quantisation on the transformer (T5 stays bf16, VAE stays FP16 to avoid colour artefacts).
Latency per step and total
| Steps | FP16 latency | FP16 ms/step | FP8 latency | FP8 ms/step |
|---|---|---|---|---|
| 1 | 0.85 s | 850 | 0.62 s | 620 |
| 2 | 1.45 s | 725 | 1.05 s | 525 |
| 4 (recommended) | 2.60 s | 650 | 1.80 s | 450 |
| 8 | 4.95 s | 619 | 3.40 s | 425 |
The schnell distillation targets 4 steps; quality plateaus past 4 and adding more steps mainly adds variance. Per-step latency drops as the per-image fixed cost (text encode, VAE decode, scheduler init) amortises over more denoising iterations.
Batched generation throughput
| Batch | FP8, 4 steps total | s/image | VRAM peak | images/min |
|---|---|---|---|---|
| 1 | 1.80 s | 1.80 | 9.2 GB | 33 |
| 2 | 3.20 s | 1.60 | 10.4 GB | 37 |
| 4 | 5.80 s | 1.45 | 13.0 GB | 41 |
| 6 | 8.50 s | 1.42 | 15.8 GB | 42 |
| 8 | 11.40 s | 1.43 | 18.6 GB | 42 |
| 12 | 17.30 s | 1.44 | 22.8 GB (tight) | 42 |
Throughput plateaus around 42 images/minute at batch 6+ — the transformer becomes compute-saturated and there’s no further gain from batching. For an interactive image-studio workload, batch 1 at 1.8s is the right default; for bulk batch generation (thumbnail farms, product catalogues), batch 6 at 1.42s/image is the sweet spot.
VRAM breakdown
| Component | FP16 | FP8 |
|---|---|---|
| Transformer (12B params) | 23.5 GB → CPU offload required | 11.8 GB resident |
| Active transformer block (FP16, sequential offload) | ~12 GB resident peak | n/a |
| T5-XXL text encoder | 9.2 GB on first prompt, then CPU | 9.2 GB → CPU after first encode |
| CLIP-L text encoder | 0.4 GB resident | 0.4 GB |
| VAE (FP16) | 0.3 GB | 0.3 GB |
| Activations (1024px, 4 steps) | 1.8 GB | 1.4 GB |
| Scratch / kernel workspace | 0.6 GB | 0.6 GB |
| Peak GPU VRAM | ~14.5 GB | ~9.2 GB |
FP8 keeps everything resident with ample headroom on a 24 GB card; FP16 needs sequential CPU offload of the T5 encoder, adding ~250 ms first-prompt overhead but no per-step penalty after the first encode. This is what the VRAM bandwidth analysis predicts.
FP16 vs FP8 quality
Across the 200-prompt evaluation set, scored against the FP16 reference:
| Metric | FP16 reference | FP8 build | Delta |
|---|---|---|---|
| CLIP-T (prompt adherence) | 31.6 | 31.4 | −0.2 (within seed variance) |
| LPIPS-vs-FP16 (perceptual diff) | — | 0.058 | imperceptible |
| FID-1k (on COCO captions) | 26.4 | 26.7 | +0.3 |
| Hand / typography failures | 9% prompts | 10% prompts | negligible |
Differences sit well inside seed-to-seed variance for a single prompt. For production work where 0.8s per image matters across thousands of images per day, FP8 is the right default. The 4090’s native FP8 tensor cores mean there is no accuracy-via-emulation cost — every FP8 operation runs in silicon.
vs SDXL, SDXL-Turbo, FLUX.1-dev
| Model | Steps | 4090 latency | VRAM | Prompt adherence | Licence |
|---|---|---|---|---|---|
| FLUX.1-schnell FP8 | 4 | 1.8 s | 9.2 GB | 87.6 | Apache 2.0 |
| FLUX.1-schnell FP16 | 4 | 2.6 s | 14.5 GB | 87.8 | Apache 2.0 |
| FLUX.1-dev FP8 | 30 | 4.0 s | 14.0 GB | 89.4 | FLUX.1 Non-Commercial |
| FLUX.1-dev FP16 | 30 | 6.2 s | 22.0 GB (tight) | 89.6 | FLUX.1 Non-Commercial |
| SDXL FP16 | 30 | 2.0 s | 9.0 GB | 78.2 | OpenRAIL++ |
| SDXL-Turbo | 4 | 0.55 s | 9.0 GB | 72.0 | SAI Community |
| SDXL-Lightning | 4 | 0.70 s | 9.0 GB | 74.5 | SAI Community |
FLUX.1-schnell occupies a clear sweet spot: better prompt adherence than SDXL/Turbo and meaningful licence freedom (Apache 2.0 vs FLUX.1-dev’s Non-Commercial), at a latency only 0.8s slower than SDXL-Lightning. For commercial workloads, schnell is the right default. See the full FLUX.1-dev benchmark for when stepping up to dev makes sense, and the SDXL benchmark for the older workflow.
Production deployment
Standard Diffusers launch with FP8 quantisation:
import torch
from diffusers import FluxPipeline
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
torch_dtype=torch.bfloat16,
).to("cuda")
# FP8 quantise the transformer (12B params, the bottleneck)
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
# Keep VAE in FP16 to avoid colour artefacts
pipe.vae.to(torch.float16)
image = pipe(
prompt="a fox in a victorian library, oil painting",
num_inference_steps=4,
guidance_scale=0.0, # schnell distillation has CFG baked in
height=1024, width=1024,
).images[0]
image.save("fox.png")
For high-throughput deployments, wrap this in a FastAPI server with a request queue and pre-encode commonly-repeated prompts. With prompt encoder caching and batch=4 the server sustains ~40 images/min single-card.
Gotchas
- Guidance scale 0.0 for schnell. The distillation has CFG baked in — passing guidance_scale > 1 produces over-saturated outputs.
- Don’t FP8 the VAE. The VAE is small and FP8 quant of its weights causes colour banding. Keep it FP16.
- T5 encode is slow on first prompt. ~9s for the first prompt warm-up; cache the text embeddings for repeat prompts.
- Sequential CPU offload only for FP16. Don’t enable it for FP8 — adds 250ms with no VRAM benefit.
- torch.compile gives ~10% but takes 90s to JIT. Use only for long-running services, not interactive notebooks.
- Apache 2.0 licence. schnell is commercially usable; FLUX.1-dev is Non-Commercial. Check the model card before shipping.
- Compare to 5060 Ti FLUX schnell if you don’t need the 4090’s headroom — Blackwell 16GB runs schnell at 2.4s FP8.
FLUX.1-schnell at 1.8s per Image
Apache 2.0 licensed, FP8 quantised, single-card serving. UK dedicated hosting.
Order the RTX 4090 24GBSee also: FLUX.1-dev benchmark, SDXL benchmark, FLUX setup guide, ComfyUI setup, SD setup, image studio use case, spec breakdown.