RTX 3050 - Order Now
Home / Blog / Tutorials / FastAPI AI Inference Server: Complete Build
Tutorials

FastAPI AI Inference Server: Complete Build

Complete guide to building a FastAPI AI inference server on a dedicated GPU covering request validation, streaming, rate limiting, authentication, health checks, and production deployment.

You will build a production-grade FastAPI server that wraps a GPU-hosted model with request validation, streaming, rate limiting, and health monitoring. By the end, you will have a deployable inference API on your dedicated GPU server that handles concurrent clients reliably.

Project Structure

Organise the server with clear separation between API routes, model management, and configuration.

inference-server/
  app/
    __init__.py
    main.py          # FastAPI app and middleware
    routes.py        # API endpoints
    models.py        # Pydantic schemas
    inference.py     # Model loading and inference logic
    config.py        # Settings and environment variables
  requirements.txt
  Dockerfile

Core Server

Set up the FastAPI app with CORS, request logging, and model lifecycle management.

# app/config.py
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    vllm_base_url: str = "http://localhost:8000/v1"
    api_key: str = "your-secret-key"
    rate_limit_rpm: int = 60
    model_name: str = "meta-llama/Llama-3.1-8B-Instruct"

settings = Settings()

# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routes import router

app = FastAPI(title="AI Inference Server", version="1.0.0")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)
app.include_router(router, prefix="/api/v1")

The server proxies to a vLLM backend, adding authentication, validation, and rate limiting that vLLM does not provide natively. For vLLM setup, see the production deployment guide.

Request and Response Schemas

Use Pydantic models for automatic request validation and OpenAPI documentation.

# app/models.py
from pydantic import BaseModel, Field

class ChatRequest(BaseModel):
    messages: list[dict] = Field(..., min_length=1)
    max_tokens: int = Field(256, ge=1, le=4096)
    temperature: float = Field(0.7, ge=0.0, le=2.0)
    top_p: float = Field(0.9, ge=0.0, le=1.0)
    stream: bool = False

class ChatResponse(BaseModel):
    id: str
    content: str
    model: str
    tokens_used: int

class HealthResponse(BaseModel):
    status: str
    model: str
    gpu_available: bool

API Endpoints

Build the inference endpoint with both synchronous and streaming modes, plus health checks for monitoring.

# app/routes.py
from fastapi import APIRouter, HTTPException, Depends
from fastapi.responses import StreamingResponse
from openai import OpenAI
from app.models import ChatRequest, ChatResponse, HealthResponse
from app.config import settings

router = APIRouter()
client = OpenAI(base_url=settings.vllm_base_url, api_key="not-needed")

@router.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    if req.stream:
        return await chat_stream(req)
    response = client.chat.completions.create(
        model=settings.model_name,
        messages=req.messages,
        max_tokens=req.max_tokens,
        temperature=req.temperature,
        top_p=req.top_p
    )
    return ChatResponse(
        id=response.id,
        content=response.choices[0].message.content,
        model=response.model,
        tokens_used=response.usage.total_tokens
    )

@router.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    async def generate():
        stream = client.chat.completions.create(
            model=settings.model_name,
            messages=req.messages,
            max_tokens=req.max_tokens,
            temperature=req.temperature,
            stream=True
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"
        yield "data: [DONE]\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

@router.get("/health", response_model=HealthResponse)
async def health():
    try:
        models = client.models.list()
        return HealthResponse(status="healthy", model=settings.model_name, gpu_available=True)
    except Exception:
        raise HTTPException(status_code=503, detail="Model unavailable")

Authentication and Rate Limiting

Add API key authentication and per-client rate limiting to protect your GPU resources.

from fastapi import Security
from fastapi.security import APIKeyHeader
from collections import defaultdict
import time

api_key_header = APIKeyHeader(name="X-API-Key")

async def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key != settings.api_key:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

# Simple in-memory rate limiter
request_counts = defaultdict(list)

async def rate_limit(api_key: str = Depends(verify_api_key)):
    now = time.time()
    request_counts[api_key] = [t for t in request_counts[api_key] if now - t < 60]
    if len(request_counts[api_key]) >= settings.rate_limit_rpm:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    request_counts[api_key].append(now)
    return api_key

# Apply to endpoints
@router.post("/chat", dependencies=[Depends(rate_limit)])
async def chat(req: ChatRequest):
    # ... existing logic

For production rate limiting with distributed state, use Redis. For a full API gateway solution, see the Kong/Traefik guide.

Deployment

# Run with Uvicorn
uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers 1

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app/ app/
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

Add Prometheus metrics for observability and ELK logging for request tracing. For the Flask alternative, see the Flask AI API guide. The self-hosting guide covers infrastructure planning, and our tutorials section has more server patterns.

Deploy AI Inference Servers on Dedicated GPUs

Run FastAPI inference servers on bare-metal GPU hardware. Full root access, no per-token fees, predictable latency.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?