RTX 3050 - Order Now
Home / Blog / Use Cases / Build AI Image Gen API with SDXL on GPU
Use Cases

Build AI Image Gen API with SDXL on GPU

Build a production image generation API with Stable Diffusion XL on a dedicated GPU server. Serve text-to-image and image-to-image endpoints with custom LoRAs, controlnets, and batch generation — no per-image fees or content restrictions.

What You’ll Build

In 30 minutes, you will have a production image generation API serving Stable Diffusion XL over HTTP. Your endpoints accept text prompts and return high-resolution images in under four seconds, with support for img2img transformations, custom LoRA styles, ControlNet conditioning, and batch generation. Running on a dedicated GPU server, you pay nothing per image — whether you generate 100 or 100,000 per day.

Cloud image generation APIs charge $0.02-$0.08 per image and restrict content through opaque filters. At production volumes of 5,000 images daily, that is $100-$400 per day in API fees. Self-hosted SDXL delivers identical quality with full control over model weights, LoRA adapters, and generation parameters without external content policies limiting your creative output.

Architecture Overview

The API wraps SDXL behind a FastAPI service with GPU memory management. Requests arrive via REST endpoints for txt2img, img2img, and inpainting. A scheduler manages the diffusion pipeline, loading base models, LoRA weights, and ControlNet models into VRAM on demand. A refiner model runs optionally for enhanced detail in the final denoising steps.

The API layer supports both synchronous generation (wait for result) and async generation (submit job, poll for result). Batch endpoints accept multiple prompts and return a ZIP archive of generated images. A caching layer stores frequently used LoRA and ControlNet combinations in VRAM to avoid reload latency.

GPU Requirements

ResolutionRecommended GPUVRAMSpeed (30 steps)
1024×1024RTX 509024 GB~3.5 seconds
1024×1024 + refinerRTX 6000 Pro40 GB~4.5 seconds
2048×2048 / batchRTX 6000 Pro 96 GB80 GB~8 seconds

SDXL base requires roughly 7GB VRAM. Adding the refiner model, ControlNet, and a LoRA adapter fits within 24GB. Concurrent generation or higher resolutions benefit from larger VRAM. See our self-hosted model guide for multi-model deployment strategies.

Step-by-Step Build

Deploy the SDXL pipeline on your GPU server with FastAPI handling HTTP requests. Configure model loading and build the generation endpoints.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from diffusers import StableDiffusionXLPipeline, AutoencoderKL
import torch, io

app = FastAPI()
vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
)
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae, torch_dtype=torch.float16
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()

@app.post("/v1/images/generations")
async def generate(prompt: str, negative_prompt: str = "",
                   width: int = 1024, height: int = 1024,
                   steps: int = 30, cfg_scale: float = 7.5,
                   seed: int = -1):
    generator = torch.Generator("cuda")
    if seed >= 0:
        generator.manual_seed(seed)

    image = pipe(
        prompt=prompt, negative_prompt=negative_prompt,
        width=width, height=height,
        num_inference_steps=steps, guidance_scale=cfg_scale,
        generator=generator
    ).images[0]

    buf = io.BytesIO()
    image.save(buf, format="PNG")
    buf.seek(0)
    return StreamingResponse(buf, media_type="image/png")

Add LoRA loading endpoints to swap styles without restarting the server. The OpenAI-compatible format for the images endpoint lets existing client code work with minimal changes. See production setup patterns for request queuing and concurrency management.

Custom Models and Styles

Load custom LoRA adapters at runtime to offer branded styles, product-specific aesthetics, or domain-specific generation. Store LoRA files on disk and expose an endpoint that loads them into the pipeline: pipe.load_lora_weights("path/to/lora.safetensors"). Multiple LoRAs can be composed with weight blending for combined effects.

For product photography, architecture visualisation, or marketing asset generation, fine-tune LoRAs on your own datasets using the same GPU during off-peak hours. Pair image generation with an LLM to auto-generate prompts from product descriptions, creating an end-to-end content pipeline.

Deploy Your Image Generation API

A self-hosted SDXL API eliminates per-image costs, removes content restrictions, and lets you deploy custom fine-tuned models. Serve internal creative teams or integrate into your product pipeline. Launch on GigaGPU dedicated GPU hosting and generate at scale. Browse more API use cases and tutorials in our library.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?