RTX 3050 - Order Now
Home / Blog / Tutorials / ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows
Tutorials

ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows

Full production ComfyUI install for the RTX 4090 24GB with the custom nodes that make FLUX.1 and SDXL workflows production-ready, plus systemd, monitoring and gotchas.

ComfyUI is the most efficient diffusion runtime for the RTX 4090 24GB. It exposes the entire pipeline as a node graph, lets you swap precisions on the fly, and uses less VRAM overhead than AUTOMATIC1111 or Forge. On Ada AD102 with 24 GB of GDDR6X you can hold FLUX.1-dev FP16 fully resident, batch SDXL four-up, and queue ControlNet workflows that would spill on a 16 GB card. This guide installs ComfyUI on a fresh GigaGPU RTX 4090 server and configures the custom nodes you actually need for SDXL and FLUX.1 production work on dedicated hosting.

Contents

Base install

sudo apt update && sudo apt install -y python3.11 python3.11-venv git
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
python3.11 -m venv venv && source venv/bin/activate
pip install --upgrade pip
pip install torch==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188 --highvram

The --highvram flag is mandatory on the 4090. It instructs Comfy to keep the UNet, VAE and text encoders resident, which removes the per-step CPU-to-GPU transfer cost. Without it, FLUX.1-dev step latency rises from 0.55 s to 0.9 s. Verify the GPU is detected:

python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_capability(0))"
# Expected: NVIDIA GeForce RTX 4090 (8, 9)

CUDA and PyTorch versions

ComponentVersionWhy
NVIDIA driver550+Ada FP8 tensor core support
CUDA toolkit12.4Best PyTorch wheels
PyTorch2.4.0+cu124SDPA flash on Ada
xformers0.0.27.post2Optional, mostly redundant with SDPA
Python3.113.10 also fine; 3.12 has wheel gaps
OSUbuntu 22.04 LTSBest wheel coverage

Essential custom nodes

Install ComfyUI-Manager first, then add the rest through its UI. Restart Comfy after installing nodes; some require Python module reloads.

cd custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager.git
cd .. && pkill -f "main.py" && python main.py --listen 0.0.0.0 --port 8188 --highvram &
Node packPurposeWhy install
ComfyUI-ManagerPlugin manager + missing-node detectionMandatory for any non-trivial workflow
ComfyUI_essentialsImage ops, smart loadersReplaces 20 other utility packs
ComfyUI-GGUFFLUX GGUF quantsQ4 FLUX in 7 GB for ControlNet stacks
ComfyUI-AnimateDiff-EvolvedSD 1.5 videoAnimation workflows
comfyui_controlnet_auxControlNet preprocessorsCanny, depth, openpose, normals
was-node-suite-comfyui200+ utility nodesString ops, math, conditional flow
ComfyUI-Impact-PackFace/segment detailersProduction face fix without inpaint round-trips
rgthree-comfyWorkflow QoLPower Lora Loader, fast group muting
ComfyUI-Custom-ScriptsAutocomplete, image feedUX improvements only, no model impact

SDXL workflow

Drop sd_xl_base_1.0.safetensors into models/checkpoints. Default workflow: Load Checkpoint → KSampler at 28 steps, CFG 7, DPM++ 2M Karras, 1024×1024 → VAE Decode → Save Image. The 4090 produces a single image in ~2.0 s; batch 4 in 5.4 s. Add the SDXL Refiner as a second KSampler with denoise=0.25 over 8 steps for an extra 1.1 s of polish.

SDXL workflowLatencyVRAM peak
Base only, batch 12.0 s10 GB
Base only, batch 45.4 s14 GB
Base + Refiner, batch 13.1 s14 GB
Base + Face Detailer5.5 s13 GB
Base + ControlNet Canny3.4 s14.5 GB

For the underlying SDXL benchmark sweep see the SDXL benchmark. For the Diffusers and A1111 alternatives see the Stable Diffusion setup.

FLUX.1 workflow

FLUX needs a different node arrangement: separate Load Diffusion Model, DualCLIPLoader (T5-XXL + CLIP-L), Load VAE, then KSampler at 4 steps Euler CFG 1.0 for schnell. Memory budget on the 4090:

ComponentFP16FP8 e4m3GGUF Q4
Diffusion model22 GB11.5 GB6.8 GB
T5-XXL encoder9.5 GB (offloaded)4.8 GB (resident)4.8 GB (resident)
CLIP-L0.25 GB0.25 GB0.25 GB
VAE0.16 GB0.16 GB0.16 GB
Total resident22.4 GB16.7 GB12 GB
4090 free~1.5 GB~7 GB~12 GB

FP8 lets you keep T5 resident which is what gets you 1.4 s per schnell image instead of 1.8 s. For LoRA stacking, ControlNet and IPAdapter combined, GGUF Q4 leaves the most headroom; use it for heavy multi-conditioning workflows. The detailed FLUX walkthrough is in the FLUX.1 setup guide.

Performance numbers

WorkflowLatencyImages/hourImages/day @ 95% util
SDXL base 28 steps2.0 s1,800~41,000
SDXL base + refiner3.1 s1,160~26,000
SD 1.5 25 steps batch 42.4 s/46,000~137,000
FLUX.1-schnell FP161.8 s2,000~46,000
FLUX.1-schnell FP81.4 s2,570~58,000
FLUX.1-dev FP8 20 steps7.5 s480~11,000
FLUX.1-dev FP16 28 steps14 s257~5,800

Run as a service with monitoring

For production traffic, run Comfy under systemd behind a reverse proxy. Create /etc/systemd/system/comfyui.service:

[Unit]
Description=ComfyUI
After=network.target

[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu/ComfyUI
Environment="PATH=/home/ubuntu/ComfyUI/venv/bin"
ExecStart=/home/ubuntu/ComfyUI/venv/bin/python main.py --listen 0.0.0.0 --port 8188 --highvram
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable with sudo systemctl enable --now comfyui. Point Caddy or nginx at port 8188 with TLS, gate access with a basic-auth header, and you have a self-hosted SDXL/FLUX endpoint that prints money compared to per-image API pricing. Add the DCGM exporter from the first-day checklist for Prometheus metrics on GPU utilisation, VRAM and temperature.

Production gotchas

  1. Custom node pip conflicts: every custom node ships its own requirements.txt and they regularly disagree on numpy, opencv or pillow versions. Pin a known-good combination once a quarter and resist updating individual nodes mid-cycle.
  2. Workflow JSON drift: ComfyUI exports embedded workflow JSON inside output PNGs. When a node author renames a parameter, old workflows silently fail with cryptic errors. Keep workflow JSONs in git, not just in image metadata.
  3. Queue blocking: ComfyUI processes prompts strictly serially. For multi-tenant traffic you need either multiple Comfy processes (different ports) or a front-end queue that batches requests; see concurrent users.
  4. VRAM not released after FLUX: switching from FLUX-dev FP16 back to SDXL within the same Comfy session can leave 8-10 GB unreleased. Restart between checkpoint families.
  5. Manager node compromise: third-party Manager listings have been used to ship malicious code. Pin specific git commits in custom_nodes rather than tracking main.
  6. Power throttling under sustained gen: SDXL batch 4 sustained pulls ~440 W. Verify thermal performance before multi-day runs.
  7. Disk growth from outputs: at 46 SDXL/min that is 10 GB/hour of PNGs. Ship outputs to S3 nightly; review the storage notes in monthly hosting cost.

Verdict

ComfyUI on a dedicated 4090 is the most efficient open-source image generation runtime per pound. The --highvram flag plus PyTorch 2.4 SDPA gets you to within 5% of theoretical Ada throughput, the custom node ecosystem covers every production need, and 24 GB of GDDR6X means FLUX-dev runs without offload. Pin your stack, run as a systemd service, monitor with Prometheus and you have a £550/month image backend with no per-image API meter. Compare to lighter cards in the 5060 Ti ComfyUI writeup or scale up via 4090 or 5090.

ComfyUI on a dedicated 4090

Pre-installed CUDA 12.4 stack, ready to clone and run. UK dedicated hosting.

Order the RTX 4090 24GB

See also: FLUX.1 setup, Stable Diffusion setup, SDXL benchmark, FLUX schnell benchmark, first-day checklist, thermal performance, image generation studio.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?