FLUX.1-schnell from Black Forest Labs is the 4-step distilled variant of FLUX.1 – SOTA image quality at fast iteration time. On the RTX 5060 Ti 16GB via our hosting, FLUX.1-schnell fits with headroom.
Contents
Setup
- Diffusers 0.30, PyTorch 2.5
- Model: black-forest-labs/FLUX.1-schnell
- 12B-param diffusion transformer (DiT), T5 + CLIP text encoders
- Resolution: 1024×1024
- Licence: Apache 2.0 (schnell variant)
FP16 Throughput
| Steps | Time | VRAM peak |
|---|---|---|
| 1 | 1.9 s | 14.8 GB |
| 2 | 2.5 s | 14.8 GB |
| 4 | 3.8 s | 14.8 GB |
FP16 just fits. 1-step produces passable images; 4-step is the recommended setting – under 4 seconds per 1024×1024 image.
FP8 Throughput
| Steps | Time | VRAM peak |
|---|---|---|
| 1 | 1.2 s | 9.2 GB |
| 2 | 1.6 s | 9.2 GB |
| 4 | 2.4 s | 9.2 GB |
FP8 drops VRAM from 14.8 GB to 9.2 GB with ~35% speed uplift on Blackwell. Quality is essentially indistinguishable at 1024×1024. The FP8-quantised weights are available from the community or produced locally via ComfyUI’s FP8 nodes.
Fit and VRAM
- FP16: 14.8 GB peak – no room for batch
- FP8: 9.2 GB – fits batch 2 comfortably
- CPU-offloaded text encoder (T5): reclaims ~3 GB at cost of ~500 ms first-token latency
vs SDXL
| Metric | FLUX.1-schnell FP8 4-step | SDXL 30-step |
|---|---|---|
| Time/image @ 1024 | 2.4 s | 3.4 s |
| Peak VRAM | 9.2 GB | 9.2 GB |
| Prompt adherence | Noticeably better | Baseline |
| Typography / text in image | Much better | Weak |
FLUX.1-schnell is faster AND higher quality than SDXL for most prompts. Unless you have an SDXL-specific LoRA or checkpoint you need, FLUX is the new default.
FLUX.1 on Blackwell 16GB
2.4 s per 1024px image at FP8. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: SDXL benchmark, SD 1.5 benchmark, ComfyUI setup, image studio, SD setup.