RTX 3050 - Order Now
Home / Blog / Tutorials / FLUX.1 on RTX 4090 24GB: schnell, dev and FP8 Quantised Production Setup
Tutorials

FLUX.1 on RTX 4090 24GB: schnell, dev and FP8 Quantised Production Setup

Production setup for FLUX.1-schnell and FLUX.1-dev on a single RTX 4090 24GB, including FP8 quantisation, LoRA stacking and ControlNet.

FLUX.1 from Black Forest Labs is the new state of the art for open image generation, and the RTX 4090 24GB dedicated server is the cheapest hardware that can run the full FP16 dev variant without any offload. The Ada AD102’s native FP8 tensor cores cut VRAM in half with no perceptible quality loss, and the 1008 GB/s GDDR6X bandwidth keeps the 12-billion-parameter diffusion transformer fed at every sampling step. This guide covers schnell, dev and FP8-quantised workflows, plus LoRA stacking and ControlNet, so you can pick the right balance of speed, VRAM and quality on a GigaGPU dedicated box.

Contents

FLUX.1 variants and licensing

VariantLicenceSampling stepsFP16 VRAMFP8 VRAMUse case
FLUX.1-schnellApache 2.0422 GB11.5 GBCommercial production, fast turnaround
FLUX.1-devNon-commercial research20-3022 GB11.5 GBHighest open-weight quality
FLUX.1-proAPI only (BFL)30+n/an/aClosed, hosted only
FLUX.1-fillNon-commercial2022 GB11.5 GBInpaint/outpaint
FLUX.1-cannyNon-commercial2022 GB11.5 GBEdge-conditioned generation
FLUX.1-depthNon-commercial2022 GB11.5 GBDepth-conditioned generation

For commercial work stick to schnell. The dev variant is for research, eval and personal projects only until BFL releases a commercial licence, which is rumoured but not promised.

Why 24 GB matters

FLUX is a 12-billion-parameter rectified-flow transformer with three text encoders (T5-XXL, CLIP-L and a custom CLIP-G replacement). At FP16 the model lands at ~22 GB resident. That leaves around 2 GB headroom on a 24 GB Ada card for batch 1, the VAE decode and a single small LoRA. Lower-VRAM cards must offload the T5-XXL encoder to CPU which costs ~3-4 s per image on a fast desktop and 8+ s on slower hardware. The 4090 keeps everything resident and runs schnell in 1.8 s per image, dev in ~11 s.

ComponentFP16FP8 e4m3NF4 (GGUF)
Diffusion transformer22 GB11.5 GB6.8 GB
T5-XXL encoder9.5 GB4.8 GB4.8 GB
CLIP-L0.25 GB0.25 GB0.25 GB
VAE0.16 GB0.16 GB0.16 GB
4090 leaves free~2 GB (T5 offloaded)~7 GB~12 GB

FP8 e4m3 on Ada native tensor cores is the production sweet spot: half the VRAM, identical visual quality, ~30% faster wall clock thanks to higher arithmetic throughput.

Install ComfyUI for FLUX

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && python3.11 -m venv venv && source venv/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# weights
cd models/checkpoints
wget https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/flux1-schnell.safetensors
cd ../clip
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/clip_l.safetensors
cd ../vae
wget https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/ae.safetensors

cd ../.. && python main.py --listen 0.0.0.0 --port 8188 --highvram

Load the official FLUX schnell example workflow from ComfyUI/custom_nodes/example_workflows/ and run with the default 4-step Euler sampler at CFG 1.0. For FP8 swap the FP16 T5 encoder for the FP8 variant via the DualCLIPLoader node and you free 4.7 GB.

If you want the ComfyUI custom node walkthrough end-to-end (Manager, ControlNet preprocessors, Impact Pack), see the ComfyUI setup guide.

FP8 quantised workflow

FP8 e4m3 weights halve VRAM with negligible quality loss on Ada Lovelace. Replace the FP16 checkpoint with flux1-dev-fp8.safetensors from kijai/flux-fp8:

cd models/checkpoints
wget https://huggingface.co/Kijai/flux-fp8/resolve/main/flux1-dev-fp8.safetensors
cd ../clip
wget https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

Free VRAM now sits around 12 GB, which is enough for two 1024×1024 images in parallel, three stacked LoRAs, or a heavy ControlNet pipeline (FLUX-canny + IPAdapter + face detailer). Wall clock drops from 11 s to 7.5 s per dev image at 20 steps, a 30% speed-up at no perceptible quality cost.

Throughput and step counts

WorkflowStepsResolutionLatencyImages/minImages/day
schnell FP1641024×10241.8 s33~47,000
schnell FP841024×10241.4 s42~60,000
dev FP16201024×102411 s5.5~7,900
dev FP8201024×10247.5 s8~11,500
dev FP16281280×76814 s4.3~6,200
dev FP8281280×7689.6 s6.3~9,000
dev FP8281920×108821 s2.9~4,100

At 33 schnell images per minute, a single 4090 pumps out ~47,000 images in a 24-hour cycle, which is enormous for a single-GPU box. See the FLUX schnell benchmark for the full step/resolution/sampler sweep, and the FLUX dev benchmark for the dev variant.

Adding LoRAs and ControlNet

FLUX LoRAs are small (50-300 MB) and load at runtime through Comfy’s LoRA Loader node. With FP8 dev you can stack three LoRAs and still leave 4 GB free, which is enough for IPAdapter conditioning. With FP16 dev you can manage one LoRA at a time without spilling.

Add-onFP16 dev VRAMFP8 dev VRAMLatency penalty
+1 LoRA+0.3 GB+0.3 GB+0.1 s
+IPAdapter+1.4 GB+1.4 GB+0.4 s
+ControlNet (FLUX-canny)+11 GB+5.8 GB+1.8 s
+Face detailer (Impact Pack)+0.6 GB+0.6 GB+1.5 s

For ControlNet specifically, FP8 is essentially mandatory on the 4090: FP16 dev plus FLUX-canny exceeds 24 GB and you start spilling to CPU.

Production gotchas

  1. T5-XXL CPU offload trap: passing --lowvram on a 4090 forces T5 offload and adds 3-4 s per image. Always use --highvram.
  2. Wrong VAE: FLUX needs the dedicated ae.safetensors from BFL. SDXL’s VAE produces noise, not images, and it is the most common day-one mistake.
  3. Sampler choice: schnell only works with Euler at CFG 1.0 (it is a distilled model). Higher CFG produces saturated colour artefacts. Dev wants Euler or DPM++ 2M at CFG 3.5-5.
  4. FP8 with old PyTorch: FP8 native tensor core support requires PyTorch 2.4 with CUDA 12.4 wheels. PyTorch 2.2 silently falls back to a software path that runs FP8 at FP16 throughput.
  5. Power throttling: FLUX-dev sustained generation pulls 440 W. Hot-spot temp will rise above 85 C in a poorly-cooled chassis; review thermal performance and power draw.
  6. NSFW false positives: BFL’s safety filter is stricter than SDXL’s and refuses some legitimate medical or art-history prompts. Disable in research workflows; comply in production.
  7. Disk fills with 1.5 MB images: at 47k schnell images/day and ~150 KB each that is 7 GB/day; FLUX-dev at 1920×1088 hits ~600 KB each. Move outputs to S3 nightly; see monthly hosting cost.

Verdict

The 4090 is the cheapest hardware that runs FLUX.1-dev FP16 with no offload, and FP8 turns it into the highest-throughput single-GPU FLUX box on the market. For Apache-licensed schnell production at 42 images per minute, you cover an enormous range of commercial workloads on £550 of monthly hardware. For dev research at 8 images per minute FP8, the 4090 is again the right unit until you genuinely need an H100. For the broader image-business case see the image generation studio writeup.

Run FLUX.1-dev on a single card

24 GB of GDDR6X, no offload, 1.8 s schnell images. UK dedicated hosting.

Order the RTX 4090 24GB

See also: ComfyUI setup, Stable Diffusion setup, FLUX schnell benchmark, FLUX dev benchmark, image studio sizing, 4090 vs 5090, spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?