You will build a REST API that generates images from text prompts using SDXL, serves them via FastAPI with request queuing, and handles concurrent users without running out of VRAM. The end result: POST a JSON prompt, receive a high-quality 1024×1024 image in 4-8 seconds. Your marketing team, product designers, and content creators all hit the same endpoint — rate-limited per API key, running entirely on your dedicated GPU server.
API Architecture
| Component | Tool | Role | Resource |
|---|---|---|---|
| Image generation | SDXL 1.0 + Refiner | Text-to-image synthesis | ~12GB VRAM |
| API framework | FastAPI | HTTP endpoint, validation | CPU |
| Queue | asyncio.Queue | Serialise GPU requests | CPU |
| Cache | Redis | Deduplicate identical prompts | RAM |
Model Loading and Setup
import torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
# Load SDXL base + refiner
base = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16, variant="fp16"
).to("cuda")
base.enable_model_cpu_offload()
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
torch_dtype=torch.float16, variant="fp16"
).to("cuda")
# Compile for speed (first run is slow, subsequent runs faster)
base.unet = torch.compile(base.unet, mode="reduce-overhead")
The SDXL pipeline uses approximately 12GB VRAM with both base and refiner loaded. On a 24GB GPU, you have room for LoRA adapters and additional models.
FastAPI Endpoint
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio, hashlib, io, base64
app = FastAPI()
class ImageRequest(BaseModel):
prompt: str
negative_prompt: str = "blurry, low quality, distorted"
width: int = 1024
height: int = 1024
steps: int = 30
use_refiner: bool = True
gpu_lock = asyncio.Lock()
@app.post("/generate")
async def generate_image(req: ImageRequest):
if len(req.prompt) > 500:
raise HTTPException(400, "Prompt exceeds 500 characters")
async with gpu_lock:
image = base(
prompt=req.prompt,
negative_prompt=req.negative_prompt,
width=req.width, height=req.height,
num_inference_steps=req.steps,
output_type="latent" if req.use_refiner else "pil"
).images[0]
if req.use_refiner:
image = refiner(
prompt=req.prompt, image=image[None, :],
num_inference_steps=10
).images[0]
buf = io.BytesIO()
image.save(buf, format="PNG")
return {"image": base64.b64encode(buf.getvalue()).decode()}
The gpu_lock ensures only one generation runs at a time, preventing VRAM contention. Concurrent requests queue automatically.
Request Queuing and Priorities
For production use with multiple teams, implement a priority queue that gives premium API keys faster processing. Add request timeout handling so users do not wait indefinitely when the queue is deep. Return a job ID immediately and let clients poll for completion — this prevents HTTP timeouts on slow generations. Integrate with ComfyUI for complex multi-step workflows that need more control than the API provides.
Prompt Caching
import redis, json
r = redis.Redis()
def get_cached_or_generate(req: ImageRequest):
cache_key = hashlib.sha256(json.dumps(req.dict(), sort_keys=True).encode()).hexdigest()
cached = r.get(f"img:{cache_key}")
if cached:
return cached # Return cached image bytes
# Generate and cache for 24 hours
image_bytes = generate_image_sync(req)
r.setex(f"img:{cache_key}", 86400, image_bytes)
return image_bytes
Caching identical prompts saves GPU cycles. Marketing teams often regenerate the same prompt while iterating on surrounding copy.
Production Hardening
Add input validation to reject prompts that could generate harmful content. Implement NSFW detection on outputs before returning them. Set maximum resolution limits to prevent VRAM exhaustion (SDXL at 2048×2048 uses significantly more VRAM than 1024×1024). Monitor GPU temperature and throttle requests if the server runs hot. Log all prompts and generation parameters for audit. Deploy behind authentication as covered in private hosting guides. See model hosting options for complementary text generation, review creative use cases, and explore more tutorials for advanced generation workflows. Check infrastructure guides for scaling.
Image Generation GPU Servers
Dedicated GPU servers for SDXL and image generation workloads. Fast VRAM, SSD storage, and full API control. UK-hosted.
Browse GPU Servers