Why Run SDXL on Dedicated Hardware
Stable Diffusion XL produces stunning 1024×1024 images with superior text rendering, composition, and detail compared with earlier versions. Running SDXL on a dedicated GPU server unlocks unlimited image generation without per-image API costs, complete privacy for generated content, and the performance to serve real-time generation for production applications. GigaGPU offers pre-configured Stable Diffusion hosting and image generator hosting for turnkey deployment.
For businesses generating product images, marketing assets, game textures, or creative content at volume, the economics of self-hosting are compelling. A single RTX 5090 generates an SDXL image in 5-8 seconds, meaning one server can produce 400-700 images per hour continuously. For detailed deployment patterns, see our guide to deploying a Stable Diffusion server.
GPU Requirements for Stable Diffusion XL
SDXL is more VRAM-hungry than SD 1.5, but modern GPUs handle it well. The base model requires approximately 7 GB of VRAM, with additional memory needed for the refiner, LoRA adapters, and batch processing.
| GPU | VRAM | SDXL Base (1024×1024) | SDXL + Refiner | Batch of 4 |
|---|---|---|---|---|
| RTX 3090 | 24 GB | ~5 sec | ~9 sec | ~15 sec |
| RTX 5090 | 24 GB | ~4 sec | ~7 sec | ~12 sec |
| RTX 5080 | 24 GB | ~7 sec | ~12 sec | ~20 sec |
| RTX 6000 Pro | 48 GB | ~5 sec | ~9 sec | ~14 sec |
| RTX 6000 Pro | 80 GB | ~3 sec | ~6 sec | ~8 sec |
The RTX 3090 vs RTX 5090 choice often comes down to price: the 3090 is more cost-effective, while the 5090 is faster per image. For commercial generation at scale, the RTX 6000 Pro delivers the best throughput per dollar for sustained workloads. Our cheapest GPU for AI inference guide covers the full pricing breakdown. If you are comparing the cost of self-hosted image generation against API services, our GPU vs API cost comparison tool can help model the economics.
Installing SDXL with Diffusers
The Hugging Face Diffusers library provides a clean Python API for SDXL inference:
# Create environment
python3 -m venv ~/sdxl-env
source ~/sdxl-env/bin/activate
# Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers accelerate safetensors
# Generate an image
python3 << 'PYEOF'
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True
).to("cuda")
# Enable memory optimization
pipe.enable_vae_slicing()
image = pipe(
prompt="A professional product photo of a sleek laptop on a minimalist desk, soft studio lighting, 8k",
negative_prompt="blurry, low quality, distorted",
num_inference_steps=30,
guidance_scale=7.5,
width=1024,
height=1024
).images[0]
image.save("output.png")
print("Image generated successfully")
PYEOF
The first run downloads model weights (~7 GB). GigaGPU servers with NVMe storage load these weights in seconds on subsequent runs. For the underlying PyTorch setup, our PyTorch GPU server installation guide covers the full environment configuration.
Deploying ComfyUI for Visual Workflows
ComfyUI provides a node-based interface for building complex generation workflows. It is ideal for teams that need fine-grained control over the generation pipeline. GigaGPU offers dedicated ComfyUI hosting with pre-installed models and extensions.
# Clone and set up ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git ~/ComfyUI
cd ~/ComfyUI
# Install dependencies
pip install -r requirements.txt
# Download SDXL model to the models directory
cd models/checkpoints
wget https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors
# Start ComfyUI
cd ~/ComfyUI
python main.py --listen 0.0.0.0 --port 8188
Access ComfyUI at http://YOUR_SERVER_IP:8188. The visual workflow editor lets you chain models, LoRA adapters, upscalers, and post-processing nodes without writing code.
Building an Image Generation API
For programmatic access, wrap SDXL in a FastAPI endpoint:
# sdxl_server.py
from fastapi import FastAPI
from fastapi.responses import Response
from pydantic import BaseModel
from diffusers import StableDiffusionXLPipeline
import torch
import io
app = FastAPI(title="SDXL Image Generation API")
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")
pipe.enable_vae_slicing()
class GenerateRequest(BaseModel):
prompt: str
negative_prompt: str = "blurry, low quality, distorted"
steps: int = 30
guidance_scale: float = 7.5
width: int = 1024
height: int = 1024
seed: int = -1
@app.post("/generate")
async def generate(req: GenerateRequest):
generator = None
if req.seed >= 0:
generator = torch.Generator("cuda").manual_seed(req.seed)
image = pipe(
prompt=req.prompt,
negative_prompt=req.negative_prompt,
num_inference_steps=req.steps,
guidance_scale=req.guidance_scale,
width=req.width,
height=req.height,
generator=generator
).images[0]
buf = io.BytesIO()
image.save(buf, format="PNG")
return Response(content=buf.getvalue(), media_type="image/png")
# Run the API
uvicorn sdxl_server:app --host 0.0.0.0 --port 8000 --workers 1
# Test with curl
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "A mountain landscape at sunset, oil painting style"}' \
--output generated.png
Optimization and Batch Generation
Maximise throughput with these optimizations:
# Enable torch.compile for 20-30% speedup (requires PyTorch 2.0+)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# Batch generation for multiple images
images = pipe(
prompt=["A red sports car", "A blue mountain lake", "A golden retriever"],
negative_prompt=["blurry"] * 3,
num_inference_steps=25,
guidance_scale=7.0,
width=1024,
height=1024
).images
for i, img in enumerate(images):
img.save(f"batch_{i}.png")
Additional optimization techniques:
- Reduce inference steps: 25 steps often matches 30-step quality with DPM++ scheduler
- Use FP16: Half-precision reduces VRAM usage and increases speed
- VAE slicing: Processes the VAE in slices to reduce peak VRAM
- torch.compile: Compiles the UNet for optimised GPU kernels
- LoRA adapters: Smaller than full fine-tuned checkpoints, fast to swap
For teams running SDXL alongside other models, a single private GPU server can host image generation, an LLM for prompt enhancement, and vision models for quality assessment in a unified pipeline.
Production Configuration
Deploy SDXL as a systemd service with persistent model caching:
# /etc/systemd/system/sdxl.service
[Unit]
Description=SDXL Image Generation API
After=network.target
[Service]
User=deploy
WorkingDirectory=/home/deploy
ExecStart=/home/deploy/sdxl-env/bin/uvicorn sdxl_server:app \
--host 0.0.0.0 --port 8000 --workers 1
Restart=always
RestartSec=10
Environment=HF_HOME=/data/huggingface
Environment=CUDA_VISIBLE_DEVICES=0
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable sdxl
sudo systemctl start sdxl
Place Nginx in front for TLS and rate limiting, following the patterns in our production inference server guide. Monitor GPU utilization and generation queue depth to ensure consistent response times. For teams scaling beyond a single GPU, see our guide to multi-GPU server setup for load balancing image generation across multiple GPUs. Check the model guides section for deployment guides covering additional image and vision models.
Run Stable Diffusion XL on Dedicated Hardware
GigaGPU provides GPU servers optimised for image generation. Pre-configured with CUDA, fast NVMe storage, and the VRAM you need for SDXL at full resolution. Generate unlimited images with zero per-image costs.
Browse GPU Servers