You will build a production-grade FastAPI server that wraps a GPU-hosted model with request validation, streaming, rate limiting, and health monitoring. By the end, you will have a deployable inference API on your dedicated GPU server that handles concurrent clients reliably.
Project Structure
Organise the server with clear separation between API routes, model management, and configuration.
inference-server/
app/
__init__.py
main.py # FastAPI app and middleware
routes.py # API endpoints
models.py # Pydantic schemas
inference.py # Model loading and inference logic
config.py # Settings and environment variables
requirements.txt
Dockerfile
Core Server
Set up the FastAPI app with CORS, request logging, and model lifecycle management.
# app/config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
vllm_base_url: str = "http://localhost:8000/v1"
api_key: str = "your-secret-key"
rate_limit_rpm: int = 60
model_name: str = "meta-llama/Llama-3.1-8B-Instruct"
settings = Settings()
# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routes import router
app = FastAPI(title="AI Inference Server", version="1.0.0")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(router, prefix="/api/v1")
The server proxies to a vLLM backend, adding authentication, validation, and rate limiting that vLLM does not provide natively. For vLLM setup, see the production deployment guide.
Request and Response Schemas
Use Pydantic models for automatic request validation and OpenAPI documentation.
# app/models.py
from pydantic import BaseModel, Field
class ChatRequest(BaseModel):
messages: list[dict] = Field(..., min_length=1)
max_tokens: int = Field(256, ge=1, le=4096)
temperature: float = Field(0.7, ge=0.0, le=2.0)
top_p: float = Field(0.9, ge=0.0, le=1.0)
stream: bool = False
class ChatResponse(BaseModel):
id: str
content: str
model: str
tokens_used: int
class HealthResponse(BaseModel):
status: str
model: str
gpu_available: bool
API Endpoints
Build the inference endpoint with both synchronous and streaming modes, plus health checks for monitoring.
# app/routes.py
from fastapi import APIRouter, HTTPException, Depends
from fastapi.responses import StreamingResponse
from openai import OpenAI
from app.models import ChatRequest, ChatResponse, HealthResponse
from app.config import settings
router = APIRouter()
client = OpenAI(base_url=settings.vllm_base_url, api_key="not-needed")
@router.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
if req.stream:
return await chat_stream(req)
response = client.chat.completions.create(
model=settings.model_name,
messages=req.messages,
max_tokens=req.max_tokens,
temperature=req.temperature,
top_p=req.top_p
)
return ChatResponse(
id=response.id,
content=response.choices[0].message.content,
model=response.model,
tokens_used=response.usage.total_tokens
)
@router.post("/chat/stream")
async def chat_stream(req: ChatRequest):
async def generate():
stream = client.chat.completions.create(
model=settings.model_name,
messages=req.messages,
max_tokens=req.max_tokens,
temperature=req.temperature,
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield f"data: {content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
@router.get("/health", response_model=HealthResponse)
async def health():
try:
models = client.models.list()
return HealthResponse(status="healthy", model=settings.model_name, gpu_available=True)
except Exception:
raise HTTPException(status_code=503, detail="Model unavailable")
Authentication and Rate Limiting
Add API key authentication and per-client rate limiting to protect your GPU resources.
from fastapi import Security
from fastapi.security import APIKeyHeader
from collections import defaultdict
import time
api_key_header = APIKeyHeader(name="X-API-Key")
async def verify_api_key(api_key: str = Security(api_key_header)):
if api_key != settings.api_key:
raise HTTPException(status_code=403, detail="Invalid API key")
return api_key
# Simple in-memory rate limiter
request_counts = defaultdict(list)
async def rate_limit(api_key: str = Depends(verify_api_key)):
now = time.time()
request_counts[api_key] = [t for t in request_counts[api_key] if now - t < 60]
if len(request_counts[api_key]) >= settings.rate_limit_rpm:
raise HTTPException(status_code=429, detail="Rate limit exceeded")
request_counts[api_key].append(now)
return api_key
# Apply to endpoints
@router.post("/chat", dependencies=[Depends(rate_limit)])
async def chat(req: ChatRequest):
# ... existing logic
For production rate limiting with distributed state, use Redis. For a full API gateway solution, see the Kong/Traefik guide.
Deployment
# Run with Uvicorn
uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers 1
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app/ app/
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
Add Prometheus metrics for observability and ELK logging for request tracing. For the Flask alternative, see the Flask AI API guide. The self-hosting guide covers infrastructure planning, and our tutorials section has more server patterns.
Deploy AI Inference Servers on Dedicated GPUs
Run FastAPI inference servers on bare-metal GPU hardware. Full root access, no per-token fees, predictable latency.
Browse GPU Servers