What You’ll Build
In 30 minutes, you will have a production image generation API serving Stable Diffusion XL over HTTP. Your endpoints accept text prompts and return high-resolution images in under four seconds, with support for img2img transformations, custom LoRA styles, ControlNet conditioning, and batch generation. Running on a dedicated GPU server, you pay nothing per image — whether you generate 100 or 100,000 per day.
Cloud image generation APIs charge $0.02-$0.08 per image and restrict content through opaque filters. At production volumes of 5,000 images daily, that is $100-$400 per day in API fees. Self-hosted SDXL delivers identical quality with full control over model weights, LoRA adapters, and generation parameters without external content policies limiting your creative output.
Architecture Overview
The API wraps SDXL behind a FastAPI service with GPU memory management. Requests arrive via REST endpoints for txt2img, img2img, and inpainting. A scheduler manages the diffusion pipeline, loading base models, LoRA weights, and ControlNet models into VRAM on demand. A refiner model runs optionally for enhanced detail in the final denoising steps.
The API layer supports both synchronous generation (wait for result) and async generation (submit job, poll for result). Batch endpoints accept multiple prompts and return a ZIP archive of generated images. A caching layer stores frequently used LoRA and ControlNet combinations in VRAM to avoid reload latency.
GPU Requirements
| Resolution | Recommended GPU | VRAM | Speed (30 steps) |
|---|---|---|---|
| 1024×1024 | RTX 5090 | 24 GB | ~3.5 seconds |
| 1024×1024 + refiner | RTX 6000 Pro | 40 GB | ~4.5 seconds |
| 2048×2048 / batch | RTX 6000 Pro 96 GB | 80 GB | ~8 seconds |
SDXL base requires roughly 7GB VRAM. Adding the refiner model, ControlNet, and a LoRA adapter fits within 24GB. Concurrent generation or higher resolutions benefit from larger VRAM. See our self-hosted model guide for multi-model deployment strategies.
Step-by-Step Build
Deploy the SDXL pipeline on your GPU server with FastAPI handling HTTP requests. Configure model loading and build the generation endpoints.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from diffusers import StableDiffusionXLPipeline, AutoencoderKL
import torch, io
app = FastAPI()
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
)
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
vae=vae, torch_dtype=torch.float16
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
@app.post("/v1/images/generations")
async def generate(prompt: str, negative_prompt: str = "",
width: int = 1024, height: int = 1024,
steps: int = 30, cfg_scale: float = 7.5,
seed: int = -1):
generator = torch.Generator("cuda")
if seed >= 0:
generator.manual_seed(seed)
image = pipe(
prompt=prompt, negative_prompt=negative_prompt,
width=width, height=height,
num_inference_steps=steps, guidance_scale=cfg_scale,
generator=generator
).images[0]
buf = io.BytesIO()
image.save(buf, format="PNG")
buf.seek(0)
return StreamingResponse(buf, media_type="image/png")
Add LoRA loading endpoints to swap styles without restarting the server. The OpenAI-compatible format for the images endpoint lets existing client code work with minimal changes. See production setup patterns for request queuing and concurrency management.
Custom Models and Styles
Load custom LoRA adapters at runtime to offer branded styles, product-specific aesthetics, or domain-specific generation. Store LoRA files on disk and expose an endpoint that loads them into the pipeline: pipe.load_lora_weights("path/to/lora.safetensors"). Multiple LoRAs can be composed with weight blending for combined effects.
For product photography, architecture visualisation, or marketing asset generation, fine-tune LoRAs on your own datasets using the same GPU during off-peak hours. Pair image generation with an LLM to auto-generate prompts from product descriptions, creating an end-to-end content pipeline.
Deploy Your Image Generation API
A self-hosted SDXL API eliminates per-image costs, removes content restrictions, and lets you deploy custom fine-tuned models. Serve internal creative teams or integrate into your product pipeline. Launch on GigaGPU dedicated GPU hosting and generate at scale. Browse more API use cases and tutorials in our library.