RTX 3050 - Order Now
Home / Blog / Tutorials / SDXL Image Generation API with FastAPI
Tutorials

SDXL Image Generation API with FastAPI

Build a production image generation API serving Stable Diffusion XL through FastAPI with queuing, caching, and GPU memory management on a dedicated server.

You will build a REST API that generates images from text prompts using SDXL, serves them via FastAPI with request queuing, and handles concurrent users without running out of VRAM. The end result: POST a JSON prompt, receive a high-quality 1024×1024 image in 4-8 seconds. Your marketing team, product designers, and content creators all hit the same endpoint — rate-limited per API key, running entirely on your dedicated GPU server.

API Architecture

ComponentToolRoleResource
Image generationSDXL 1.0 + RefinerText-to-image synthesis~12GB VRAM
API frameworkFastAPIHTTP endpoint, validationCPU
Queueasyncio.QueueSerialise GPU requestsCPU
CacheRedisDeduplicate identical promptsRAM

Model Loading and Setup

import torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline

# Load SDXL base + refiner
base = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, variant="fp16"
).to("cuda")
base.enable_model_cpu_offload()

refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    torch_dtype=torch.float16, variant="fp16"
).to("cuda")

# Compile for speed (first run is slow, subsequent runs faster)
base.unet = torch.compile(base.unet, mode="reduce-overhead")

The SDXL pipeline uses approximately 12GB VRAM with both base and refiner loaded. On a 24GB GPU, you have room for LoRA adapters and additional models.

FastAPI Endpoint

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio, hashlib, io, base64

app = FastAPI()

class ImageRequest(BaseModel):
    prompt: str
    negative_prompt: str = "blurry, low quality, distorted"
    width: int = 1024
    height: int = 1024
    steps: int = 30
    use_refiner: bool = True

gpu_lock = asyncio.Lock()

@app.post("/generate")
async def generate_image(req: ImageRequest):
    if len(req.prompt) > 500:
        raise HTTPException(400, "Prompt exceeds 500 characters")

    async with gpu_lock:
        image = base(
            prompt=req.prompt,
            negative_prompt=req.negative_prompt,
            width=req.width, height=req.height,
            num_inference_steps=req.steps,
            output_type="latent" if req.use_refiner else "pil"
        ).images[0]

        if req.use_refiner:
            image = refiner(
                prompt=req.prompt, image=image[None, :],
                num_inference_steps=10
            ).images[0]

    buf = io.BytesIO()
    image.save(buf, format="PNG")
    return {"image": base64.b64encode(buf.getvalue()).decode()}

The gpu_lock ensures only one generation runs at a time, preventing VRAM contention. Concurrent requests queue automatically.

Request Queuing and Priorities

For production use with multiple teams, implement a priority queue that gives premium API keys faster processing. Add request timeout handling so users do not wait indefinitely when the queue is deep. Return a job ID immediately and let clients poll for completion — this prevents HTTP timeouts on slow generations. Integrate with ComfyUI for complex multi-step workflows that need more control than the API provides.

Prompt Caching

import redis, json
r = redis.Redis()

def get_cached_or_generate(req: ImageRequest):
    cache_key = hashlib.sha256(json.dumps(req.dict(), sort_keys=True).encode()).hexdigest()
    cached = r.get(f"img:{cache_key}")
    if cached:
        return cached  # Return cached image bytes
    # Generate and cache for 24 hours
    image_bytes = generate_image_sync(req)
    r.setex(f"img:{cache_key}", 86400, image_bytes)
    return image_bytes

Caching identical prompts saves GPU cycles. Marketing teams often regenerate the same prompt while iterating on surrounding copy.

Production Hardening

Add input validation to reject prompts that could generate harmful content. Implement NSFW detection on outputs before returning them. Set maximum resolution limits to prevent VRAM exhaustion (SDXL at 2048×2048 uses significantly more VRAM than 1024×1024). Monitor GPU temperature and throttle requests if the server runs hot. Log all prompts and generation parameters for audit. Deploy behind authentication as covered in private hosting guides. See model hosting options for complementary text generation, review creative use cases, and explore more tutorials for advanced generation workflows. Check infrastructure guides for scaling.

Image Generation GPU Servers

Dedicated GPU servers for SDXL and image generation workloads. Fast VRAM, SSD storage, and full API control. UK-hosted.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?