A single AI inference endpoint works until you need rate limiting, authentication, model routing, or load balancing across multiple GPU servers. An API gateway sits in front of your inference backends and handles all of that without modifying your model-serving code. This guide walks through configuring Kong and Traefik as gateways for self-hosted AI on dedicated GPU servers.
Why an API Gateway for AI
Running a vLLM production endpoint or an Ollama server directly exposed to clients creates operational problems as usage grows. An API gateway centralises cross-cutting concerns that every inference API eventually needs.
| Concern | Without Gateway | With Gateway |
|---|---|---|
| Rate limiting | Custom middleware per service | Declarative policy |
| Auth | Baked into inference code | Gateway-level API keys / JWT |
| Load balancing | DNS round-robin or manual | Active health checks, weighted routing |
| Model routing | Separate URLs per model | Single entry point, header-based routing |
| TLS termination | Per-service certificate management | Centralised at the gateway |
Kong Gateway Configuration
Kong runs as a reverse proxy with a plugin architecture. Install it alongside your inference servers and define services, routes, and plugins declaratively.
# docker-compose.yml
version: "3.8"
services:
kong:
image: kong:3.6
environment:
KONG_DATABASE: "off"
KONG_DECLARATIVE_CONFIG: /etc/kong/kong.yml
KONG_PROXY_LISTEN: "0.0.0.0:8080"
KONG_ADMIN_LISTEN: "0.0.0.0:8001"
volumes:
- ./kong.yml:/etc/kong/kong.yml
ports:
- "8080:8080"
- "8001:8001"
networks:
- ai-net
vllm:
image: vllm/vllm-openai:latest
command: ["--model", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"]
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
networks:
- ai-net
networks:
ai-net:
# kong.yml -- declarative config
_format_version: "3.0"
services:
- name: llm-inference
url: http://vllm:8000
routes:
- name: llm-route
paths:
- /v1
strip_path: false
plugins:
- name: rate-limiting
config:
minute: 60
policy: local
- name: key-auth
config:
key_names: ["X-API-Key"]
- name: cors
config:
origins: ["*"]
methods: ["GET", "POST", "OPTIONS"]
- name: embedding-service
url: http://embedding:8000
routes:
- name: embedding-route
paths:
- /v1/embeddings
strip_path: false
consumers:
- username: production-app
keyauth_credentials:
- key: prod-api-key-here
This configuration places vLLM behind Kong with rate limiting at 60 requests per minute, API key authentication, and CORS headers — all without touching the inference server code.
Traefik Configuration
Traefik uses labels-based routing and automatic service discovery, making it lighter weight than Kong for simpler setups. It integrates natively with Docker and Kubernetes.
# docker-compose.yml with Traefik
version: "3.8"
services:
traefik:
image: traefik:v3.0
command:
- "--providers.docker=true"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.le.acme.httpchallenge.entrypoint=web"
- "--certificatesresolvers.le.acme.email=admin@yourdomain.com"
- "--certificatesresolvers.le.acme.storage=/letsencrypt/acme.json"
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- letsencrypt:/letsencrypt
vllm-llama:
image: vllm/vllm-openai:latest
command: ["--model", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"]
labels:
- "traefik.http.routers.llama.rule=Host(`api.yourdomain.com`) && PathPrefix(`/v1`)"
- "traefik.http.routers.llama.tls.certresolver=le"
- "traefik.http.services.llama.loadbalancer.server.port=8000"
- "traefik.http.middlewares.llm-ratelimit.ratelimit.average=60"
- "traefik.http.middlewares.llm-ratelimit.ratelimit.burst=20"
- "traefik.http.routers.llama.middlewares=llm-ratelimit"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
volumes:
letsencrypt:
Traefik automatically obtains TLS certificates via Let’s Encrypt and routes traffic to the vLLM container. Adding a second model is as simple as adding another service with different routing labels.
Multi-Model Routing
Route requests to different models based on headers or paths. This lets clients use a single gateway URL and select models dynamically.
# Kong: header-based model routing
services:
- name: llama-8b
url: http://vllm-8b:8000
routes:
- name: llama-8b-route
paths: ["/v1"]
headers:
X-Model-Size: ["8b"]
- name: llama-70b
url: http://vllm-70b:8000
routes:
- name: llama-70b-route
paths: ["/v1"]
headers:
X-Model-Size: ["70b"]
Clients send X-Model-Size: 8b or X-Model-Size: 70b and the gateway routes to the correct backend. Combine this with vLLM or Ollama backends to serve different model families from one gateway.
Health Checks and Monitoring
Both gateways support active health checks to remove unhealthy backends from the rotation. Pair health checks with Prometheus and Grafana for visibility into gateway and backend performance.
# Kong active health check config
services:
- name: llm-inference
url: http://vllm:8000
connect_timeout: 5000
read_timeout: 120000 # LLM inference can be slow
healthchecks:
active:
http_path: /health
healthy:
interval: 10
successes: 2
unhealthy:
interval: 5
http_failures: 3
Set read timeouts high enough for long-running inference requests. A 120-second timeout covers most generation workloads. Track inference logs with the ELK stack for debugging failed requests at the gateway layer.
Kong vs Traefik for AI
Choose Kong when you need rich plugin functionality: advanced rate limiting per consumer, OAuth2 authentication, request transformation, or analytics. Kong’s plugin ecosystem covers most enterprise requirements out of the box.
Choose Traefik when you want minimal configuration overhead and native Docker/Kubernetes integration. Traefik excels at automatic service discovery and TLS certificate management with less operational burden.
Both gateways work well with the FastAPI inference server pattern. For async inference behind the gateway, add a Redis queue and deliver results via webhooks. Our self-hosting guide covers the full infrastructure stack, and the tutorials section has more integration patterns.
Deploy Gated AI APIs on Dedicated GPUs
Run Kong or Traefik in front of your inference stack on bare-metal GPU servers. Rate limiting, auth, and TLS in minutes.
Browse GPU Servers