FLUX.1 from Black Forest Labs is the new state of the art for open image generation, and the RTX 4090 24GB dedicated server is the cheapest hardware that can run the full FP16 dev variant without any offload. The Ada AD102’s native FP8 tensor cores cut VRAM in half with no perceptible quality loss, and the 1008 GB/s GDDR6X bandwidth keeps the 12-billion-parameter diffusion transformer fed at every sampling step. This guide covers schnell, dev and FP8-quantised workflows, plus LoRA stacking and ControlNet, so you can pick the right balance of speed, VRAM and quality on a GigaGPU dedicated box.
Contents
- FLUX.1 variants and licensing
- Why 24 GB matters
- Install ComfyUI for FLUX
- FP8 quantised workflow
- Throughput and step counts
- Adding LoRAs and ControlNet
- Production gotchas
FLUX.1 variants and licensing
| Variant | Licence | Sampling steps | FP16 VRAM | FP8 VRAM | Use case |
|---|---|---|---|---|---|
| FLUX.1-schnell | Apache 2.0 | 4 | 22 GB | 11.5 GB | Commercial production, fast turnaround |
| FLUX.1-dev | Non-commercial research | 20-30 | 22 GB | 11.5 GB | Highest open-weight quality |
| FLUX.1-pro | API only (BFL) | 30+ | n/a | n/a | Closed, hosted only |
| FLUX.1-fill | Non-commercial | 20 | 22 GB | 11.5 GB | Inpaint/outpaint |
| FLUX.1-canny | Non-commercial | 20 | 22 GB | 11.5 GB | Edge-conditioned generation |
| FLUX.1-depth | Non-commercial | 20 | 22 GB | 11.5 GB | Depth-conditioned generation |
For commercial work stick to schnell. The dev variant is for research, eval and personal projects only until BFL releases a commercial licence, which is rumoured but not promised.
Why 24 GB matters
FLUX is a 12-billion-parameter rectified-flow transformer with three text encoders (T5-XXL, CLIP-L and a custom CLIP-G replacement). At FP16 the model lands at ~22 GB resident. That leaves around 2 GB headroom on a 24 GB Ada card for batch 1, the VAE decode and a single small LoRA. Lower-VRAM cards must offload the T5-XXL encoder to CPU which costs ~3-4 s per image on a fast desktop and 8+ s on slower hardware. The 4090 keeps everything resident and runs schnell in 1.8 s per image, dev in ~11 s.
| Component | FP16 | FP8 e4m3 | NF4 (GGUF) |
|---|---|---|---|
| Diffusion transformer | 22 GB | 11.5 GB | 6.8 GB |
| T5-XXL encoder | 9.5 GB | 4.8 GB | 4.8 GB |
| CLIP-L | 0.25 GB | 0.25 GB | 0.25 GB |
| VAE | 0.16 GB | 0.16 GB | 0.16 GB |
| 4090 leaves free | ~2 GB (T5 offloaded) | ~7 GB | ~12 GB |
FP8 e4m3 on Ada native tensor cores is the production sweet spot: half the VRAM, identical visual quality, ~30% faster wall clock thanks to higher arithmetic throughput.
Install ComfyUI for FLUX
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && python3.11 -m venv venv && source venv/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# weights
cd models/checkpoints
wget https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/flux1-schnell.safetensors
cd ../clip
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/clip_l.safetensors
cd ../vae
wget https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/ae.safetensors
cd ../.. && python main.py --listen 0.0.0.0 --port 8188 --highvram
Load the official FLUX schnell example workflow from ComfyUI/custom_nodes/example_workflows/ and run with the default 4-step Euler sampler at CFG 1.0. For FP8 swap the FP16 T5 encoder for the FP8 variant via the DualCLIPLoader node and you free 4.7 GB.
If you want the ComfyUI custom node walkthrough end-to-end (Manager, ControlNet preprocessors, Impact Pack), see the ComfyUI setup guide.
FP8 quantised workflow
FP8 e4m3 weights halve VRAM with negligible quality loss on Ada Lovelace. Replace the FP16 checkpoint with flux1-dev-fp8.safetensors from kijai/flux-fp8:
cd models/checkpoints
wget https://huggingface.co/Kijai/flux-fp8/resolve/main/flux1-dev-fp8.safetensors
cd ../clip
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors
Free VRAM now sits around 12 GB, which is enough for two 1024×1024 images in parallel, three stacked LoRAs, or a heavy ControlNet pipeline (FLUX-canny + IPAdapter + face detailer). Wall clock drops from 11 s to 7.5 s per dev image at 20 steps, a 30% speed-up at no perceptible quality cost.
Throughput and step counts
| Workflow | Steps | Resolution | Latency | Images/min | Images/day |
|---|---|---|---|---|---|
| schnell FP16 | 4 | 1024×1024 | 1.8 s | 33 | ~47,000 |
| schnell FP8 | 4 | 1024×1024 | 1.4 s | 42 | ~60,000 |
| dev FP16 | 20 | 1024×1024 | 11 s | 5.5 | ~7,900 |
| dev FP8 | 20 | 1024×1024 | 7.5 s | 8 | ~11,500 |
| dev FP16 | 28 | 1280×768 | 14 s | 4.3 | ~6,200 |
| dev FP8 | 28 | 1280×768 | 9.6 s | 6.3 | ~9,000 |
| dev FP8 | 28 | 1920×1088 | 21 s | 2.9 | ~4,100 |
At 33 schnell images per minute, a single 4090 pumps out ~47,000 images in a 24-hour cycle, which is enormous for a single-GPU box. See the FLUX schnell benchmark for the full step/resolution/sampler sweep, and the FLUX dev benchmark for the dev variant.
Adding LoRAs and ControlNet
FLUX LoRAs are small (50-300 MB) and load at runtime through Comfy’s LoRA Loader node. With FP8 dev you can stack three LoRAs and still leave 4 GB free, which is enough for IPAdapter conditioning. With FP16 dev you can manage one LoRA at a time without spilling.
| Add-on | FP16 dev VRAM | FP8 dev VRAM | Latency penalty |
|---|---|---|---|
| +1 LoRA | +0.3 GB | +0.3 GB | +0.1 s |
| +IPAdapter | +1.4 GB | +1.4 GB | +0.4 s |
| +ControlNet (FLUX-canny) | +11 GB | +5.8 GB | +1.8 s |
| +Face detailer (Impact Pack) | +0.6 GB | +0.6 GB | +1.5 s |
For ControlNet specifically, FP8 is essentially mandatory on the 4090: FP16 dev plus FLUX-canny exceeds 24 GB and you start spilling to CPU.
Production gotchas
- T5-XXL CPU offload trap: passing
--lowvramon a 4090 forces T5 offload and adds 3-4 s per image. Always use--highvram. - Wrong VAE: FLUX needs the dedicated
ae.safetensorsfrom BFL. SDXL’s VAE produces noise, not images, and it is the most common day-one mistake. - Sampler choice: schnell only works with Euler at CFG 1.0 (it is a distilled model). Higher CFG produces saturated colour artefacts. Dev wants Euler or DPM++ 2M at CFG 3.5-5.
- FP8 with old PyTorch: FP8 native tensor core support requires PyTorch 2.4 with CUDA 12.4 wheels. PyTorch 2.2 silently falls back to a software path that runs FP8 at FP16 throughput.
- Power throttling: FLUX-dev sustained generation pulls 440 W. Hot-spot temp will rise above 85 C in a poorly-cooled chassis; review thermal performance and power draw.
- NSFW false positives: BFL’s safety filter is stricter than SDXL’s and refuses some legitimate medical or art-history prompts. Disable in research workflows; comply in production.
- Disk fills with 1.5 MB images: at 47k schnell images/day and ~150 KB each that is 7 GB/day; FLUX-dev at 1920×1088 hits ~600 KB each. Move outputs to S3 nightly; see monthly hosting cost.
Verdict
The 4090 is the cheapest hardware that runs FLUX.1-dev FP16 with no offload, and FP8 turns it into the highest-throughput single-GPU FLUX box on the market. For Apache-licensed schnell production at 42 images per minute, you cover an enormous range of commercial workloads on £550 of monthly hardware. For dev research at 8 images per minute FP8, the 4090 is again the right unit until you genuinely need an H100. For the broader image-business case see the image generation studio writeup.
Run FLUX.1-dev on a single card
24 GB of GDDR6X, no offload, 1.8 s schnell images. UK dedicated hosting.
Order the RTX 4090 24GBSee also: ComfyUI setup, Stable Diffusion setup, FLUX schnell benchmark, FLUX dev benchmark, image studio sizing, 4090 vs 5090, spec breakdown.