RTX 3050 - Order Now
Home / Blog / Tutorials / Stable Diffusion on RTX 4090 24GB: Diffusers, A1111 and ComfyUI Production Setup
Tutorials

Stable Diffusion on RTX 4090 24GB: Diffusers, A1111 and ComfyUI Production Setup

Production setup for SD 1.5, SDXL and FLUX.1 on the RTX 4090 24GB across Diffusers, AUTOMATIC1111 and ComfyUI, with verified throughput and gotchas.

The RTX 4090 24GB is comfortably the fastest single-GPU image generation card you can rent at consumer prices. With 24 GB of GDDR6X you keep SDXL fully resident, run FLUX.1-dev in FP16 with no offload, and batch SD 1.5 four-up for production work. The 16,384 CUDA cores and 1008 GB/s bandwidth give you H100-class image throughput per pound. This guide walks through three popular stacks on a GigaGPU 4090 dedicated server, with the same recipes we ship to dedicated hosting customers and the gotchas you only learn the hard way.

Contents

What 24 GB unlocks

SDXL needs ~10 GB at FP16, FLUX.1-dev needs ~22 GB at FP16. The 4090’s 1008 GB/s bandwidth and 16,384 CUDA cores make Ada the sweet spot for diffusion. You get H100-class image throughput without renting an H100, and the native FP8 tensor cores halve VRAM on FLUX with negligible quality loss.

ModelPrecisionVRAM1024×1024 latencySteps
SD 1.5FP162.4 GB0.8 s25
SDXL BaseFP169.8 GB2.0 s28
SDXL + RefinerFP1614.2 GB3.1 s28+8
FLUX.1-schnellFP811.5 GB1.8 s4
FLUX.1-schnellFP1622.0 GB1.8 s4
FLUX.1-devFP1622.0 GB11 s28
FLUX.1-devFP811.5 GB7.5 s20
SD3 MediumFP165.6 GB2.4 s28

Base system setup

If you have not already, run the first-day checklist to verify the driver, NVMe and firewall. The image stacks need:

LayerVersionWhy
OSUbuntu 22.04 LTSBest wheel coverage
NVIDIA driver550+FP8 tensor core support
CUDA toolkit12.4PyTorch 2.4 wheels
Python3.10 or 3.113.12 still has wheel gaps
PyTorch2.4.0+cu124SDPA flash on Ada
NVMe free500+ GBModels, LoRAs, outputs

Diffusers (the API path)

The Hugging Face Diffusers library is the cleanest option for embedding SDXL or FLUX in a backend. It gives you Python objects, no UI, and direct access to the latents pipeline.

python3.11 -m venv ~/diff-env && source ~/diff-env/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install diffusers==0.30.3 transformers==4.45.2 accelerate xformers safetensors
pip install --upgrade huggingface_hub
huggingface-cli login

Generate an SDXL image as a one-liner inside a service:

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.unet.to(memory_format=torch.channels_last)

img = pipe("a photo of a Welsh hillside at dawn",
           num_inference_steps=28, guidance_scale=7.0).images[0]
img.save("out.png")

memory_format=torch.channels_last gives a 12-15% speed boost on Ada at zero quality cost; enable_xformers_memory_efficient_attention() recovers ~1.2 GB VRAM. Together they let you run batch 6 SDXL on a single 4090.

AUTOMATIC1111 setup

A1111 is still the most familiar UI for individual creators, especially with the model and extension ecosystem. Install with Python 3.10 specifically (3.11+ breaks several extensions):

sudo apt install -y python3.10 python3.10-venv git
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
cd stable-diffusion-webui
python3.10 -m venv venv && source venv/bin/activate
./webui.sh --listen --xformers --port 7860 --api

Tune for Ada by setting the following in webui-user.sh:

export COMMANDLINE_ARGS="--listen --port 7860 --opt-sdp-attention --no-half-vae --api --enable-insecure-extension-access"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

--opt-sdp-attention is faster than xformers on Ada in PyTorch 2.4+. --no-half-vae stops black-image artefacts on the SDXL VAE. The 4090 saturates at batch size 4 for 1024×1024 SDXL, producing 4 images in 5.4 s.

ComfyUI setup

ComfyUI gives you the most control and the lowest VRAM overhead, important for FLUX.1-dev workflows. See the dedicated ComfyUI setup for the full custom-node walkthrough.

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && python3.11 -m venv venv && source venv/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188 --highvram --preview-method auto

Drop SDXL or FLUX safetensors into models/checkpoints/ and FLUX text encoders into models/clip/. --highvram tells Comfy to keep the entire UNet resident, halving step latency on the 4090. Add --preview-method auto for live latent previews, which costs ~3% throughput but is invaluable for art-direction work.

Throughput numbers

StackModelBatchLatencyImages/min
DiffusersSDXL FP1612.0 s30
DiffusersSDXL FP1645.4 s44
A1111SDXL FP1612.2 s27
A1111SDXL FP1645.6 s43
ComfyUISDXL FP1611.9 s32
ComfyUISDXL FP1645.2 s46
ComfyUIFLUX.1-schnell FP811.4 s42
ComfyUIFLUX.1-schnell FP1611.8 s33
ComfyUIFLUX.1-dev FP817.5 s8
ComfyUIFLUX.1-dev FP16111 s5.5

At ~46 SDXL images/min the 4090 produces ~66,000 images per day at full utilisation, comfortably more than enough for an SMB asset pipeline. See the SDXL benchmark and FLUX schnell benchmark for the full sweep.

FLUX.1 quick start

FLUX.1-schnell is the easiest first step: Apache-licensed, four-step sampling, 1.8 s per image at FP16. Drop the FP8 weights into Comfy, load the default schnell workflow, and you are done. For higher fidelity move to FLUX.1-dev (non-commercial) which fits in 22 GB at FP16, or FLUX.1-dev FP8 which leaves ~12 GB free for ControlNet and IPAdapter pipelines. The detailed walkthrough is in the FLUX setup guide; for image studio sizing see the image generation studio writeup.

Production gotchas

  1. SDXL VAE black images: the SDXL FP16 VAE has known instability. Always pass --no-half-vae in A1111 or use the madebyollin/sdxl-vae-fp16-fix VAE in Diffusers.
  2. Memory leak on long-running A1111: A1111 leaks ~50 MB per generation in some extension combinations. Restart the worker every 1,000 generations or after extension updates.
  3. xformers vs SDPA confusion: PyTorch 2.4’s SDPA is faster than xformers on Ada for SDXL but slower for SD 1.5. Benchmark both for your workload.
  4. Concurrent users: A1111 single-threads its UI; for production traffic you need ComfyUI behind a queue or Diffusers in a worker pool. See concurrent users.
  5. Disk fills with outputs: at 46 SDXL images/min and ~150 KB each, you generate ~10 GB/hour of PNGs. Move outputs to S3 or a separate volume nightly; see monthly hosting cost for storage notes.
  6. FLUX text encoder offload: lower-VRAM cards offload T5-XXL to CPU and pay 3-4 s per image. The 4090 keeps T5 resident at FP16; do not pass --lowvram or you lose that speed.
  7. Power draw under sustained image gen: SDXL batch 4 pulls ~440 W sustained. Verify thermal performance ahead of multi-day runs.

Verdict

For production diffusion on a single GPU, the 4090 is the right unit. Diffusers wins for backend embedding, ComfyUI for production workflows and FLUX, A1111 for individual creators and the extension ecosystem. Pick one, pin your stack, run the throughput numbers above as your acceptance test on day one, and you have a £550-month image generation backend that prints money compared to per-image API pricing. For comparison with cheaper hardware see the 5060 Ti SD setup; for the next tier up see 4090 or 5090.

Production diffusion on a 4090

SDXL, FLUX.1 and SD 1.5 ready to deploy. UK dedicated hosting.

Order the RTX 4090 24GB

See also: FLUX.1 setup, ComfyUI setup, SDXL benchmark, FLUX dev benchmark, image generation studio, 4090 vs 5090, TFLOPS class.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?