Home / Blog / Tutorials / ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows

Tutorials

ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows

Full production ComfyUI install for the RTX 4090 24GB with the custom nodes that make FLUX.1 and SDXL workflows production-ready, plus systemd, monitoring and gotchas.

Tutorials May 4, 2026 5 min read gigagpu

ComfyUI is the most efficient diffusion runtime for the RTX 4090 24GB. It exposes the entire pipeline as a node graph, lets you swap precisions on the fly, and uses less VRAM overhead than AUTOMATIC1111 or Forge. On Ada AD102 with 24 GB of GDDR6X you can hold FLUX.1-dev FP16 fully resident, batch SDXL four-up, and queue ControlNet workflows that would spill on a 16 GB card. This guide installs ComfyUI on a fresh GigaGPU RTX 4090 server and configures the custom nodes you actually need for SDXL and FLUX.1 production work on dedicated hosting.

Base install

sudo apt update && sudo apt install -y python3.11 python3.11-venv git
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
python3.11 -m venv venv && source venv/bin/activate
pip install --upgrade pip
pip install torch==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188 --highvram

The --highvram flag is mandatory on the 4090. It instructs Comfy to keep the UNet, VAE and text encoders resident, which removes the per-step CPU-to-GPU transfer cost. Without it, FLUX.1-dev step latency rises from 0.55 s to 0.9 s. Verify the GPU is detected:

python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_capability(0))"
# Expected: NVIDIA GeForce RTX 4090 (8, 9)

CUDA and PyTorch versions

Component	Version	Why
NVIDIA driver	550+	Ada FP8 tensor core support
CUDA toolkit	12.4	Best PyTorch wheels
PyTorch	2.4.0+cu124	SDPA flash on Ada
xformers	0.0.27.post2	Optional, mostly redundant with SDPA
Python	3.11	3.10 also fine; 3.12 has wheel gaps
OS	Ubuntu 22.04 LTS	Best wheel coverage

Essential custom nodes

Install ComfyUI-Manager first, then add the rest through its UI. Restart Comfy after installing nodes; some require Python module reloads.

cd custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager.git
cd .. && pkill -f "main.py" && python main.py --listen 0.0.0.0 --port 8188 --highvram &

Node pack	Purpose	Why install
ComfyUI-Manager	Plugin manager + missing-node detection	Mandatory for any non-trivial workflow
ComfyUI_essentials	Image ops, smart loaders	Replaces 20 other utility packs
ComfyUI-GGUF	FLUX GGUF quants	Q4 FLUX in 7 GB for ControlNet stacks
ComfyUI-AnimateDiff-Evolved	SD 1.5 video	Animation workflows
comfyui_controlnet_aux	ControlNet preprocessors	Canny, depth, openpose, normals
was-node-suite-comfyui	200+ utility nodes	String ops, math, conditional flow
ComfyUI-Impact-Pack	Face/segment detailers	Production face fix without inpaint round-trips
rgthree-comfy	Workflow QoL	Power Lora Loader, fast group muting
ComfyUI-Custom-Scripts	Autocomplete, image feed	UX improvements only, no model impact

SDXL workflow

Drop sd_xl_base_1.0.safetensors into models/checkpoints. Default workflow: Load Checkpoint → KSampler at 28 steps, CFG 7, DPM++ 2M Karras, 1024×1024 → VAE Decode → Save Image. The 4090 produces a single image in ~2.0 s; batch 4 in 5.4 s. Add the SDXL Refiner as a second KSampler with denoise=0.25 over 8 steps for an extra 1.1 s of polish.

SDXL workflow	Latency	VRAM peak
Base only, batch 1	2.0 s	10 GB
Base only, batch 4	5.4 s	14 GB
Base + Refiner, batch 1	3.1 s	14 GB
Base + Face Detailer	5.5 s	13 GB
Base + ControlNet Canny	3.4 s	14.5 GB

For the underlying SDXL benchmark sweep see the SDXL benchmark. For the Diffusers and A1111 alternatives see the Stable Diffusion setup.

FLUX.1 workflow

FLUX needs a different node arrangement: separate Load Diffusion Model, DualCLIPLoader (T5-XXL + CLIP-L), Load VAE, then KSampler at 4 steps Euler CFG 1.0 for schnell. Memory budget on the 4090:

Component	FP16	FP8 e4m3	GGUF Q4
Diffusion model	22 GB	11.5 GB	6.8 GB
T5-XXL encoder	9.5 GB (offloaded)	4.8 GB (resident)	4.8 GB (resident)
CLIP-L	0.25 GB	0.25 GB	0.25 GB
VAE	0.16 GB	0.16 GB	0.16 GB
Total resident	22.4 GB	16.7 GB	12 GB
4090 free	~1.5 GB	~7 GB	~12 GB

FP8 lets you keep T5 resident which is what gets you 1.4 s per schnell image instead of 1.8 s. For LoRA stacking, ControlNet and IPAdapter combined, GGUF Q4 leaves the most headroom; use it for heavy multi-conditioning workflows. The detailed FLUX walkthrough is in the FLUX.1 setup guide.

Performance numbers

Workflow	Latency	Images/hour	Images/day @ 95% util
SDXL base 28 steps	2.0 s	1,800	~41,000
SDXL base + refiner	3.1 s	1,160	~26,000
SD 1.5 25 steps batch 4	2.4 s/4	6,000	~137,000
FLUX.1-schnell FP16	1.8 s	2,000	~46,000
FLUX.1-schnell FP8	1.4 s	2,570	~58,000
FLUX.1-dev FP8 20 steps	7.5 s	480	~11,000
FLUX.1-dev FP16 28 steps	14 s	257	~5,800

Run as a service with monitoring

For production traffic, run Comfy under systemd behind a reverse proxy. Create /etc/systemd/system/comfyui.service:

[Unit]
Description=ComfyUI
After=network.target

[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu/ComfyUI
Environment="PATH=/home/ubuntu/ComfyUI/venv/bin"
ExecStart=/home/ubuntu/ComfyUI/venv/bin/python main.py --listen 0.0.0.0 --port 8188 --highvram
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable with sudo systemctl enable --now comfyui. Point Caddy or nginx at port 8188 with TLS, gate access with a basic-auth header, and you have a self-hosted SDXL/FLUX endpoint that prints money compared to per-image API pricing. Add the DCGM exporter from the first-day checklist for Prometheus metrics on GPU utilisation, VRAM and temperature.

Production gotchas

Custom node pip conflicts: every custom node ships its own requirements.txt and they regularly disagree on numpy, opencv or pillow versions. Pin a known-good combination once a quarter and resist updating individual nodes mid-cycle.
Workflow JSON drift: ComfyUI exports embedded workflow JSON inside output PNGs. When a node author renames a parameter, old workflows silently fail with cryptic errors. Keep workflow JSONs in git, not just in image metadata.
Queue blocking: ComfyUI processes prompts strictly serially. For multi-tenant traffic you need either multiple Comfy processes (different ports) or a front-end queue that batches requests; see concurrent users.
VRAM not released after FLUX: switching from FLUX-dev FP16 back to SDXL within the same Comfy session can leave 8-10 GB unreleased. Restart between checkpoint families.
Manager node compromise: third-party Manager listings have been used to ship malicious code. Pin specific git commits in custom_nodes rather than tracking main.
Power throttling under sustained gen: SDXL batch 4 sustained pulls ~440 W. Verify thermal performance before multi-day runs.
Disk growth from outputs: at 46 SDXL/min that is 10 GB/hour of PNGs. Ship outputs to S3 nightly; review the storage notes in monthly hosting cost.

Verdict

ComfyUI on a dedicated 4090 is the most efficient open-source image generation runtime per pound. The --highvram flag plus PyTorch 2.4 SDPA gets you to within 5% of theoretical Ada throughput, the custom node ecosystem covers every production need, and 24 GB of GDDR6X means FLUX-dev runs without offload. Pin your stack, run as a systemd service, monitor with Prometheus and you have a £550/month image backend with no per-image API meter. Compare to lighter cards in the 5060 Ti ComfyUI writeup or scale up via 4090 or 5090.

ComfyUI on a dedicated 4090

Pre-installed CUDA 12.4 stack, ready to clone and run. UK dedicated hosting.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows

Contents

Base install

CUDA and PyTorch versions

Essential custom nodes

SDXL workflow

FLUX.1 workflow

Performance numbers

Run as a service with monitoring

Production gotchas

Verdict

ComfyUI on a dedicated 4090

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows

Contents

Base install

CUDA and PyTorch versions

Essential custom nodes

SDXL workflow

FLUX.1 workflow

Performance numbers

Run as a service with monitoring

Production gotchas

Verdict

ComfyUI on a dedicated 4090

Need a Dedicated GPU Server?

gigagpu

Related Articles

Python GPU Memory Not Released After Inference: Fix

NVMe RAID for Faster Model Loading

Faster-Whisper Install Issues: Fix Guide

Celery + GPU: Distributed AI Tasks

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?