Home / Blog / Tutorials / Stable Diffusion on RTX 4090 24GB: Diffusers, A1111 and ComfyUI Production Setup

Tutorials

Stable Diffusion on RTX 4090 24GB: Diffusers, A1111 and ComfyUI Production Setup

Production setup for SD 1.5, SDXL and FLUX.1 on the RTX 4090 24GB across Diffusers, AUTOMATIC1111 and ComfyUI, with verified throughput and gotchas.

Tutorials May 4, 2026 5 min read gigagpu

The RTX 4090 24GB is comfortably the fastest single-GPU image generation card you can rent at consumer prices. With 24 GB of GDDR6X you keep SDXL fully resident, run FLUX.1-dev in FP16 with no offload, and batch SD 1.5 four-up for production work. The 16,384 CUDA cores and 1008 GB/s bandwidth give you H100-class image throughput per pound. This guide walks through three popular stacks on a GigaGPU 4090 dedicated server, with the same recipes we ship to dedicated hosting customers and the gotchas you only learn the hard way.

What 24 GB unlocks

SDXL needs ~10 GB at FP16, FLUX.1-dev needs ~22 GB at FP16. The 4090’s 1008 GB/s bandwidth and 16,384 CUDA cores make Ada the sweet spot for diffusion. You get H100-class image throughput without renting an H100, and the native FP8 tensor cores halve VRAM on FLUX with negligible quality loss.

Model	Precision	VRAM	1024×1024 latency	Steps
SD 1.5	FP16	2.4 GB	0.8 s	25
SDXL Base	FP16	9.8 GB	2.0 s	28
SDXL + Refiner	FP16	14.2 GB	3.1 s	28+8
FLUX.1-schnell	FP8	11.5 GB	1.8 s	4
FLUX.1-schnell	FP16	22.0 GB	1.8 s	4
FLUX.1-dev	FP16	22.0 GB	11 s	28
FLUX.1-dev	FP8	11.5 GB	7.5 s	20
SD3 Medium	FP16	5.6 GB	2.4 s	28

Base system setup

If you have not already, run the first-day checklist to verify the driver, NVMe and firewall. The image stacks need:

Layer	Version	Why
OS	Ubuntu 22.04 LTS	Best wheel coverage
NVIDIA driver	550+	FP8 tensor core support
CUDA toolkit	12.4	PyTorch 2.4 wheels
Python	3.10 or 3.11	3.12 still has wheel gaps
PyTorch	2.4.0+cu124	SDPA flash on Ada
NVMe free	500+ GB	Models, LoRAs, outputs

Diffusers (the API path)

The Hugging Face Diffusers library is the cleanest option for embedding SDXL or FLUX in a backend. It gives you Python objects, no UI, and direct access to the latents pipeline.

python3.11 -m venv ~/diff-env && source ~/diff-env/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install diffusers==0.30.3 transformers==4.45.2 accelerate xformers safetensors
pip install --upgrade huggingface_hub
huggingface-cli login

Generate an SDXL image as a one-liner inside a service:

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.unet.to(memory_format=torch.channels_last)

img = pipe("a photo of a Welsh hillside at dawn",
           num_inference_steps=28, guidance_scale=7.0).images[0]
img.save("out.png")

memory_format=torch.channels_last gives a 12-15% speed boost on Ada at zero quality cost; enable_xformers_memory_efficient_attention() recovers ~1.2 GB VRAM. Together they let you run batch 6 SDXL on a single 4090.

AUTOMATIC1111 setup

A1111 is still the most familiar UI for individual creators, especially with the model and extension ecosystem. Install with Python 3.10 specifically (3.11+ breaks several extensions):

sudo apt install -y python3.10 python3.10-venv git
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
cd stable-diffusion-webui
python3.10 -m venv venv && source venv/bin/activate
./webui.sh --listen --xformers --port 7860 --api

Tune for Ada by setting the following in webui-user.sh:

export COMMANDLINE_ARGS="--listen --port 7860 --opt-sdp-attention --no-half-vae --api --enable-insecure-extension-access"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

--opt-sdp-attention is faster than xformers on Ada in PyTorch 2.4+. --no-half-vae stops black-image artefacts on the SDXL VAE. The 4090 saturates at batch size 4 for 1024×1024 SDXL, producing 4 images in 5.4 s.

ComfyUI setup

ComfyUI gives you the most control and the lowest VRAM overhead, important for FLUX.1-dev workflows. See the dedicated ComfyUI setup for the full custom-node walkthrough.

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && python3.11 -m venv venv && source venv/bin/activate
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188 --highvram --preview-method auto

Drop SDXL or FLUX safetensors into models/checkpoints/ and FLUX text encoders into models/clip/. --highvram tells Comfy to keep the entire UNet resident, halving step latency on the 4090. Add --preview-method auto for live latent previews, which costs ~3% throughput but is invaluable for art-direction work.

Throughput numbers

Stack	Model	Batch	Latency	Images/min
Diffusers	SDXL FP16	1	2.0 s	30
Diffusers	SDXL FP16	4	5.4 s	44
A1111	SDXL FP16	1	2.2 s	27
A1111	SDXL FP16	4	5.6 s	43
ComfyUI	SDXL FP16	1	1.9 s	32
ComfyUI	SDXL FP16	4	5.2 s	46
ComfyUI	FLUX.1-schnell FP8	1	1.4 s	42
ComfyUI	FLUX.1-schnell FP16	1	1.8 s	33
ComfyUI	FLUX.1-dev FP8	1	7.5 s	8
ComfyUI	FLUX.1-dev FP16	1	11 s	5.5

At ~46 SDXL images/min the 4090 produces ~66,000 images per day at full utilisation, comfortably more than enough for an SMB asset pipeline. See the SDXL benchmark and FLUX schnell benchmark for the full sweep.

FLUX.1 quick start

FLUX.1-schnell is the easiest first step: Apache-licensed, four-step sampling, 1.8 s per image at FP16. Drop the FP8 weights into Comfy, load the default schnell workflow, and you are done. For higher fidelity move to FLUX.1-dev (non-commercial) which fits in 22 GB at FP16, or FLUX.1-dev FP8 which leaves ~12 GB free for ControlNet and IPAdapter pipelines. The detailed walkthrough is in the FLUX setup guide; for image studio sizing see the image generation studio writeup.

Production gotchas

SDXL VAE black images: the SDXL FP16 VAE has known instability. Always pass --no-half-vae in A1111 or use the madebyollin/sdxl-vae-fp16-fix VAE in Diffusers.
Memory leak on long-running A1111: A1111 leaks ~50 MB per generation in some extension combinations. Restart the worker every 1,000 generations or after extension updates.
xformers vs SDPA confusion: PyTorch 2.4’s SDPA is faster than xformers on Ada for SDXL but slower for SD 1.5. Benchmark both for your workload.
Concurrent users: A1111 single-threads its UI; for production traffic you need ComfyUI behind a queue or Diffusers in a worker pool. See concurrent users.
Disk fills with outputs: at 46 SDXL images/min and ~150 KB each, you generate ~10 GB/hour of PNGs. Move outputs to S3 or a separate volume nightly; see monthly hosting cost for storage notes.
FLUX text encoder offload: lower-VRAM cards offload T5-XXL to CPU and pay 3-4 s per image. The 4090 keeps T5 resident at FP16; do not pass --lowvram or you lose that speed.
Power draw under sustained image gen: SDXL batch 4 pulls ~440 W sustained. Verify thermal performance ahead of multi-day runs.

Verdict

For production diffusion on a single GPU, the 4090 is the right unit. Diffusers wins for backend embedding, ComfyUI for production workflows and FLUX, A1111 for individual creators and the extension ecosystem. Pick one, pin your stack, run the throughput numbers above as your acceptance test on day one, and you have a £550-month image generation backend that prints money compared to per-image API pricing. For comparison with cheaper hardware see the 5060 Ti SD setup; for the next tier up see 4090 or 5090.

Production diffusion on a 4090

SDXL, FLUX.1 and SD 1.5 ready to deploy. UK dedicated hosting.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Stable Diffusion on RTX 4090 24GB: Diffusers, A1111 and ComfyUI Production Setup

Contents

What 24 GB unlocks

Base system setup

Diffusers (the API path)

AUTOMATIC1111 setup

ComfyUI setup

Throughput numbers

FLUX.1 quick start

Production gotchas

Verdict

Production diffusion on a 4090

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Stable Diffusion on RTX 4090 24GB: Diffusers, A1111 and ComfyUI Production Setup

Contents

What 24 GB unlocks

Base system setup

Diffusers (the API path)

AUTOMATIC1111 setup

ComfyUI setup

Throughput numbers

FLUX.1 quick start

Production gotchas

Verdict

Production diffusion on a 4090

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect Jupyter Notebook to GPU Server for AI

Document Processing Pipeline Self-Hosted

Migrate from HF Endpoints: Zero-Shot Classification

nginx Config for Self-Hosted OpenAI-Compatible API

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?