ComfyUI is the most efficient diffusion runtime for the RTX 4090 24GB. It exposes the entire pipeline as a node graph, lets you swap precisions on the fly, and uses less VRAM overhead than AUTOMATIC1111 or Forge. On Ada AD102 with 24 GB of GDDR6X you can hold FLUX.1-dev FP16 fully resident, batch SDXL four-up, and queue ControlNet workflows that would spill on a 16 GB card. This guide installs ComfyUI on a fresh GigaGPU RTX 4090 server and configures the custom nodes you actually need for SDXL and FLUX.1 production work on dedicated hosting.
Contents
- Base install
- CUDA and PyTorch versions
- Essential custom nodes
- SDXL workflow
- FLUX.1 workflow
- Performance numbers
- Run as a service with monitoring
- Production gotchas
Base install
sudo apt update && sudo apt install -y python3.11 python3.11-venv git
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
python3.11 -m venv venv && source venv/bin/activate
pip install --upgrade pip
pip install torch==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188 --highvram
The --highvram flag is mandatory on the 4090. It instructs Comfy to keep the UNet, VAE and text encoders resident, which removes the per-step CPU-to-GPU transfer cost. Without it, FLUX.1-dev step latency rises from 0.55 s to 0.9 s. Verify the GPU is detected:
python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_capability(0))"
# Expected: NVIDIA GeForce RTX 4090 (8, 9)
CUDA and PyTorch versions
| Component | Version | Why |
|---|---|---|
| NVIDIA driver | 550+ | Ada FP8 tensor core support |
| CUDA toolkit | 12.4 | Best PyTorch wheels |
| PyTorch | 2.4.0+cu124 | SDPA flash on Ada |
| xformers | 0.0.27.post2 | Optional, mostly redundant with SDPA |
| Python | 3.11 | 3.10 also fine; 3.12 has wheel gaps |
| OS | Ubuntu 22.04 LTS | Best wheel coverage |
Essential custom nodes
Install ComfyUI-Manager first, then add the rest through its UI. Restart Comfy after installing nodes; some require Python module reloads.
cd custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager.git
cd .. && pkill -f "main.py" && python main.py --listen 0.0.0.0 --port 8188 --highvram &
| Node pack | Purpose | Why install |
|---|---|---|
| ComfyUI-Manager | Plugin manager + missing-node detection | Mandatory for any non-trivial workflow |
| ComfyUI_essentials | Image ops, smart loaders | Replaces 20 other utility packs |
| ComfyUI-GGUF | FLUX GGUF quants | Q4 FLUX in 7 GB for ControlNet stacks |
| ComfyUI-AnimateDiff-Evolved | SD 1.5 video | Animation workflows |
| comfyui_controlnet_aux | ControlNet preprocessors | Canny, depth, openpose, normals |
| was-node-suite-comfyui | 200+ utility nodes | String ops, math, conditional flow |
| ComfyUI-Impact-Pack | Face/segment detailers | Production face fix without inpaint round-trips |
| rgthree-comfy | Workflow QoL | Power Lora Loader, fast group muting |
| ComfyUI-Custom-Scripts | Autocomplete, image feed | UX improvements only, no model impact |
SDXL workflow
Drop sd_xl_base_1.0.safetensors into models/checkpoints. Default workflow: Load Checkpoint → KSampler at 28 steps, CFG 7, DPM++ 2M Karras, 1024×1024 → VAE Decode → Save Image. The 4090 produces a single image in ~2.0 s; batch 4 in 5.4 s. Add the SDXL Refiner as a second KSampler with denoise=0.25 over 8 steps for an extra 1.1 s of polish.
| SDXL workflow | Latency | VRAM peak |
|---|---|---|
| Base only, batch 1 | 2.0 s | 10 GB |
| Base only, batch 4 | 5.4 s | 14 GB |
| Base + Refiner, batch 1 | 3.1 s | 14 GB |
| Base + Face Detailer | 5.5 s | 13 GB |
| Base + ControlNet Canny | 3.4 s | 14.5 GB |
For the underlying SDXL benchmark sweep see the SDXL benchmark. For the Diffusers and A1111 alternatives see the Stable Diffusion setup.
FLUX.1 workflow
FLUX needs a different node arrangement: separate Load Diffusion Model, DualCLIPLoader (T5-XXL + CLIP-L), Load VAE, then KSampler at 4 steps Euler CFG 1.0 for schnell. Memory budget on the 4090:
| Component | FP16 | FP8 e4m3 | GGUF Q4 |
|---|---|---|---|
| Diffusion model | 22 GB | 11.5 GB | 6.8 GB |
| T5-XXL encoder | 9.5 GB (offloaded) | 4.8 GB (resident) | 4.8 GB (resident) |
| CLIP-L | 0.25 GB | 0.25 GB | 0.25 GB |
| VAE | 0.16 GB | 0.16 GB | 0.16 GB |
| Total resident | 22.4 GB | 16.7 GB | 12 GB |
| 4090 free | ~1.5 GB | ~7 GB | ~12 GB |
FP8 lets you keep T5 resident which is what gets you 1.4 s per schnell image instead of 1.8 s. For LoRA stacking, ControlNet and IPAdapter combined, GGUF Q4 leaves the most headroom; use it for heavy multi-conditioning workflows. The detailed FLUX walkthrough is in the FLUX.1 setup guide.
Performance numbers
| Workflow | Latency | Images/hour | Images/day @ 95% util |
|---|---|---|---|
| SDXL base 28 steps | 2.0 s | 1,800 | ~41,000 |
| SDXL base + refiner | 3.1 s | 1,160 | ~26,000 |
| SD 1.5 25 steps batch 4 | 2.4 s/4 | 6,000 | ~137,000 |
| FLUX.1-schnell FP16 | 1.8 s | 2,000 | ~46,000 |
| FLUX.1-schnell FP8 | 1.4 s | 2,570 | ~58,000 |
| FLUX.1-dev FP8 20 steps | 7.5 s | 480 | ~11,000 |
| FLUX.1-dev FP16 28 steps | 14 s | 257 | ~5,800 |
Run as a service with monitoring
For production traffic, run Comfy under systemd behind a reverse proxy. Create /etc/systemd/system/comfyui.service:
[Unit]
Description=ComfyUI
After=network.target
[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu/ComfyUI
Environment="PATH=/home/ubuntu/ComfyUI/venv/bin"
ExecStart=/home/ubuntu/ComfyUI/venv/bin/python main.py --listen 0.0.0.0 --port 8188 --highvram
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable with sudo systemctl enable --now comfyui. Point Caddy or nginx at port 8188 with TLS, gate access with a basic-auth header, and you have a self-hosted SDXL/FLUX endpoint that prints money compared to per-image API pricing. Add the DCGM exporter from the first-day checklist for Prometheus metrics on GPU utilisation, VRAM and temperature.
Production gotchas
- Custom node pip conflicts: every custom node ships its own
requirements.txtand they regularly disagree on numpy, opencv or pillow versions. Pin a known-good combination once a quarter and resist updating individual nodes mid-cycle. - Workflow JSON drift: ComfyUI exports embedded workflow JSON inside output PNGs. When a node author renames a parameter, old workflows silently fail with cryptic errors. Keep workflow JSONs in git, not just in image metadata.
- Queue blocking: ComfyUI processes prompts strictly serially. For multi-tenant traffic you need either multiple Comfy processes (different ports) or a front-end queue that batches requests; see concurrent users.
- VRAM not released after FLUX: switching from FLUX-dev FP16 back to SDXL within the same Comfy session can leave 8-10 GB unreleased. Restart between checkpoint families.
- Manager node compromise: third-party Manager listings have been used to ship malicious code. Pin specific git commits in
custom_nodesrather than trackingmain. - Power throttling under sustained gen: SDXL batch 4 sustained pulls ~440 W. Verify thermal performance before multi-day runs.
- Disk growth from outputs: at 46 SDXL/min that is 10 GB/hour of PNGs. Ship outputs to S3 nightly; review the storage notes in monthly hosting cost.
Verdict
ComfyUI on a dedicated 4090 is the most efficient open-source image generation runtime per pound. The --highvram flag plus PyTorch 2.4 SDPA gets you to within 5% of theoretical Ada throughput, the custom node ecosystem covers every production need, and 24 GB of GDDR6X means FLUX-dev runs without offload. Pin your stack, run as a systemd service, monitor with Prometheus and you have a £550/month image backend with no per-image API meter. Compare to lighter cards in the 5060 Ti ComfyUI writeup or scale up via 4090 or 5090.
ComfyUI on a dedicated 4090
Pre-installed CUDA 12.4 stack, ready to clone and run. UK dedicated hosting.
Order the RTX 4090 24GBSee also: FLUX.1 setup, Stable Diffusion setup, SDXL benchmark, FLUX schnell benchmark, first-day checklist, thermal performance, image generation studio.