Home / Blog / Tutorials / FLUX.1 on RTX 4090 24GB: schnell, dev and FP8 Quantised Production Setup

Tutorials

FLUX.1 on RTX 4090 24GB: schnell, dev and FP8 Quantised Production Setup

Production setup for FLUX.1-schnell and FLUX.1-dev on a single RTX 4090 24GB, including FP8 quantisation, LoRA stacking and ControlNet.

Tutorials May 4, 2026 5 min read gigagpu

FLUX.1 from Black Forest Labs is the new state of the art for open image generation, and the RTX 4090 24GB dedicated server is the cheapest hardware that can run the full FP16 dev variant without any offload. The Ada AD102’s native FP8 tensor cores cut VRAM in half with no perceptible quality loss, and the 1008 GB/s GDDR6X bandwidth keeps the 12-billion-parameter diffusion transformer fed at every sampling step. This guide covers schnell, dev and FP8-quantised workflows, plus LoRA stacking and ControlNet, so you can pick the right balance of speed, VRAM and quality on a GigaGPU dedicated box.

FLUX.1 variants and licensing

Variant	Licence	Sampling steps	FP16 VRAM	FP8 VRAM	Use case
FLUX.1-schnell	Apache 2.0	4	22 GB	11.5 GB	Commercial production, fast turnaround
FLUX.1-dev	Non-commercial research	20-30	22 GB	11.5 GB	Highest open-weight quality
FLUX.1-pro	API only (BFL)	30+	n/a	n/a	Closed, hosted only
FLUX.1-fill	Non-commercial	20	22 GB	11.5 GB	Inpaint/outpaint
FLUX.1-canny	Non-commercial	20	22 GB	11.5 GB	Edge-conditioned generation
FLUX.1-depth	Non-commercial	20	22 GB	11.5 GB	Depth-conditioned generation

For commercial work stick to schnell. The dev variant is for research, eval and personal projects only until BFL releases a commercial licence, which is rumoured but not promised.

Why 24 GB matters

FLUX is a 12-billion-parameter rectified-flow transformer with three text encoders (T5-XXL, CLIP-L and a custom CLIP-G replacement). At FP16 the model lands at ~22 GB resident. That leaves around 2 GB headroom on a 24 GB Ada card for batch 1, the VAE decode and a single small LoRA. Lower-VRAM cards must offload the T5-XXL encoder to CPU which costs ~3-4 s per image on a fast desktop and 8+ s on slower hardware. The 4090 keeps everything resident and runs schnell in 1.8 s per image, dev in ~11 s.

Component	FP16	FP8 e4m3	NF4 (GGUF)
Diffusion transformer	22 GB	11.5 GB	6.8 GB
T5-XXL encoder	9.5 GB	4.8 GB	4.8 GB
CLIP-L	0.25 GB	0.25 GB	0.25 GB
VAE	0.16 GB	0.16 GB	0.16 GB
4090 leaves free	~2 GB (T5 offloaded)	~7 GB	~12 GB

FP8 e4m3 on Ada native tensor cores is the production sweet spot: half the VRAM, identical visual quality, ~30% faster wall clock thanks to higher arithmetic throughput.

Install ComfyUI for FLUX

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && python3.11 -m venv venv && source venv/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# weights
cd models/checkpoints
wget https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/flux1-schnell.safetensors
cd ../clip
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/clip_l.safetensors
cd ../vae
wget https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/ae.safetensors

cd ../.. && python main.py --listen 0.0.0.0 --port 8188 --highvram

Load the official FLUX schnell example workflow from ComfyUI/custom_nodes/example_workflows/ and run with the default 4-step Euler sampler at CFG 1.0. For FP8 swap the FP16 T5 encoder for the FP8 variant via the DualCLIPLoader node and you free 4.7 GB.

If you want the ComfyUI custom node walkthrough end-to-end (Manager, ControlNet preprocessors, Impact Pack), see the ComfyUI setup guide.

FP8 quantised workflow

FP8 e4m3 weights halve VRAM with negligible quality loss on Ada Lovelace. Replace the FP16 checkpoint with flux1-dev-fp8.safetensors from kijai/flux-fp8:

cd models/checkpoints
wget https://huggingface.co/Kijai/flux-fp8/resolve/main/flux1-dev-fp8.safetensors
cd ../clip
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

Free VRAM now sits around 12 GB, which is enough for two 1024×1024 images in parallel, three stacked LoRAs, or a heavy ControlNet pipeline (FLUX-canny + IPAdapter + face detailer). Wall clock drops from 11 s to 7.5 s per dev image at 20 steps, a 30% speed-up at no perceptible quality cost.

Throughput and step counts

Workflow	Steps	Resolution	Latency	Images/min	Images/day
schnell FP16	4	1024×1024	1.8 s	33	~47,000
schnell FP8	4	1024×1024	1.4 s	42	~60,000
dev FP16	20	1024×1024	11 s	5.5	~7,900
dev FP8	20	1024×1024	7.5 s	8	~11,500
dev FP16	28	1280×768	14 s	4.3	~6,200
dev FP8	28	1280×768	9.6 s	6.3	~9,000
dev FP8	28	1920×1088	21 s	2.9	~4,100

At 33 schnell images per minute, a single 4090 pumps out ~47,000 images in a 24-hour cycle, which is enormous for a single-GPU box. See the FLUX schnell benchmark for the full step/resolution/sampler sweep, and the FLUX dev benchmark for the dev variant.

Adding LoRAs and ControlNet

FLUX LoRAs are small (50-300 MB) and load at runtime through Comfy’s LoRA Loader node. With FP8 dev you can stack three LoRAs and still leave 4 GB free, which is enough for IPAdapter conditioning. With FP16 dev you can manage one LoRA at a time without spilling.

Add-on	FP16 dev VRAM	FP8 dev VRAM	Latency penalty
+1 LoRA	+0.3 GB	+0.3 GB	+0.1 s
+IPAdapter	+1.4 GB	+1.4 GB	+0.4 s
+ControlNet (FLUX-canny)	+11 GB	+5.8 GB	+1.8 s
+Face detailer (Impact Pack)	+0.6 GB	+0.6 GB	+1.5 s

For ControlNet specifically, FP8 is essentially mandatory on the 4090: FP16 dev plus FLUX-canny exceeds 24 GB and you start spilling to CPU.

Production gotchas

T5-XXL CPU offload trap: passing --lowvram on a 4090 forces T5 offload and adds 3-4 s per image. Always use --highvram.
Wrong VAE: FLUX needs the dedicated ae.safetensors from BFL. SDXL’s VAE produces noise, not images, and it is the most common day-one mistake.
Sampler choice: schnell only works with Euler at CFG 1.0 (it is a distilled model). Higher CFG produces saturated colour artefacts. Dev wants Euler or DPM++ 2M at CFG 3.5-5.
FP8 with old PyTorch: FP8 native tensor core support requires PyTorch 2.4 with CUDA 12.4 wheels. PyTorch 2.2 silently falls back to a software path that runs FP8 at FP16 throughput.
Power throttling: FLUX-dev sustained generation pulls 440 W. Hot-spot temp will rise above 85 C in a poorly-cooled chassis; review thermal performance and power draw.
NSFW false positives: BFL’s safety filter is stricter than SDXL’s and refuses some legitimate medical or art-history prompts. Disable in research workflows; comply in production.
Disk fills with 1.5 MB images: at 47k schnell images/day and ~150 KB each that is 7 GB/day; FLUX-dev at 1920×1088 hits ~600 KB each. Move outputs to S3 nightly; see monthly hosting cost.

Verdict

The 4090 is the cheapest hardware that runs FLUX.1-dev FP16 with no offload, and FP8 turns it into the highest-throughput single-GPU FLUX box on the market. For Apache-licensed schnell production at 42 images per minute, you cover an enormous range of commercial workloads on £550 of monthly hardware. For dev research at 8 images per minute FP8, the 4090 is again the right unit until you genuinely need an H100. For the broader image-business case see the image generation studio writeup.

Run FLUX.1-dev on a single card

24 GB of GDDR6X, no offload, 1.8 s schnell images. UK dedicated hosting.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

FLUX.1 on RTX 4090 24GB: schnell, dev and FP8 Quantised Production Setup

Contents

FLUX.1 variants and licensing

Why 24 GB matters

Install ComfyUI for FLUX

FP8 quantised workflow

Throughput and step counts

Adding LoRAs and ControlNet

Production gotchas

Verdict

Run FLUX.1-dev on a single card

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

FLUX.1 on RTX 4090 24GB: schnell, dev and FP8 Quantised Production Setup

Contents

FLUX.1 variants and licensing

Why 24 GB matters

Install ComfyUI for FLUX

FP8 quantised workflow

Throughput and step counts

Adding LoRAs and ControlNet

Production gotchas

Verdict

Run FLUX.1-dev on a single card

Need a Dedicated GPU Server?

gigagpu

Related Articles

LangGraph Production Deployment

Ollama Custom Model Import via Modelfile

vLLM Quantized Model Loading Issues: GPTQ/AWQ Fix

Whisper Language Detection Wrong: Fix

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?