RTX 3050 - Order Now
Home / Blog / Tutorials / API Gateway for AI: Kong/Traefik Setup
Tutorials

API Gateway for AI: Kong/Traefik Setup

Complete guide to configuring Kong and Traefik as API gateways for self-hosted AI inference covering rate limiting, authentication, load balancing, and routing multiple models behind a single endpoint.

A single AI inference endpoint works until you need rate limiting, authentication, model routing, or load balancing across multiple GPU servers. An API gateway sits in front of your inference backends and handles all of that without modifying your model-serving code. This guide walks through configuring Kong and Traefik as gateways for self-hosted AI on dedicated GPU servers.

Why an API Gateway for AI

Running a vLLM production endpoint or an Ollama server directly exposed to clients creates operational problems as usage grows. An API gateway centralises cross-cutting concerns that every inference API eventually needs.

ConcernWithout GatewayWith Gateway
Rate limitingCustom middleware per serviceDeclarative policy
AuthBaked into inference codeGateway-level API keys / JWT
Load balancingDNS round-robin or manualActive health checks, weighted routing
Model routingSeparate URLs per modelSingle entry point, header-based routing
TLS terminationPer-service certificate managementCentralised at the gateway

Kong Gateway Configuration

Kong runs as a reverse proxy with a plugin architecture. Install it alongside your inference servers and define services, routes, and plugins declaratively.

# docker-compose.yml
version: "3.8"
services:
  kong:
    image: kong:3.6
    environment:
      KONG_DATABASE: "off"
      KONG_DECLARATIVE_CONFIG: /etc/kong/kong.yml
      KONG_PROXY_LISTEN: "0.0.0.0:8080"
      KONG_ADMIN_LISTEN: "0.0.0.0:8001"
    volumes:
      - ./kong.yml:/etc/kong/kong.yml
    ports:
      - "8080:8080"
      - "8001:8001"
    networks:
      - ai-net

  vllm:
    image: vllm/vllm-openai:latest
    command: ["--model", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"]
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    networks:
      - ai-net

networks:
  ai-net:
# kong.yml -- declarative config
_format_version: "3.0"

services:
  - name: llm-inference
    url: http://vllm:8000
    routes:
      - name: llm-route
        paths:
          - /v1
        strip_path: false
    plugins:
      - name: rate-limiting
        config:
          minute: 60
          policy: local
      - name: key-auth
        config:
          key_names: ["X-API-Key"]
      - name: cors
        config:
          origins: ["*"]
          methods: ["GET", "POST", "OPTIONS"]

  - name: embedding-service
    url: http://embedding:8000
    routes:
      - name: embedding-route
        paths:
          - /v1/embeddings
        strip_path: false

consumers:
  - username: production-app
    keyauth_credentials:
      - key: prod-api-key-here

This configuration places vLLM behind Kong with rate limiting at 60 requests per minute, API key authentication, and CORS headers — all without touching the inference server code.

Traefik Configuration

Traefik uses labels-based routing and automatic service discovery, making it lighter weight than Kong for simpler setups. It integrates natively with Docker and Kubernetes.

# docker-compose.yml with Traefik
version: "3.8"
services:
  traefik:
    image: traefik:v3.0
    command:
      - "--providers.docker=true"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--certificatesresolvers.le.acme.httpchallenge.entrypoint=web"
      - "--certificatesresolvers.le.acme.email=admin@yourdomain.com"
      - "--certificatesresolvers.le.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - letsencrypt:/letsencrypt

  vllm-llama:
    image: vllm/vllm-openai:latest
    command: ["--model", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"]
    labels:
      - "traefik.http.routers.llama.rule=Host(`api.yourdomain.com`) && PathPrefix(`/v1`)"
      - "traefik.http.routers.llama.tls.certresolver=le"
      - "traefik.http.services.llama.loadbalancer.server.port=8000"
      - "traefik.http.middlewares.llm-ratelimit.ratelimit.average=60"
      - "traefik.http.middlewares.llm-ratelimit.ratelimit.burst=20"
      - "traefik.http.routers.llama.middlewares=llm-ratelimit"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

volumes:
  letsencrypt:

Traefik automatically obtains TLS certificates via Let’s Encrypt and routes traffic to the vLLM container. Adding a second model is as simple as adding another service with different routing labels.

Multi-Model Routing

Route requests to different models based on headers or paths. This lets clients use a single gateway URL and select models dynamically.

# Kong: header-based model routing
services:
  - name: llama-8b
    url: http://vllm-8b:8000
    routes:
      - name: llama-8b-route
        paths: ["/v1"]
        headers:
          X-Model-Size: ["8b"]

  - name: llama-70b
    url: http://vllm-70b:8000
    routes:
      - name: llama-70b-route
        paths: ["/v1"]
        headers:
          X-Model-Size: ["70b"]

Clients send X-Model-Size: 8b or X-Model-Size: 70b and the gateway routes to the correct backend. Combine this with vLLM or Ollama backends to serve different model families from one gateway.

Health Checks and Monitoring

Both gateways support active health checks to remove unhealthy backends from the rotation. Pair health checks with Prometheus and Grafana for visibility into gateway and backend performance.

# Kong active health check config
services:
  - name: llm-inference
    url: http://vllm:8000
    connect_timeout: 5000
    read_timeout: 120000  # LLM inference can be slow
    healthchecks:
      active:
        http_path: /health
        healthy:
          interval: 10
          successes: 2
        unhealthy:
          interval: 5
          http_failures: 3

Set read timeouts high enough for long-running inference requests. A 120-second timeout covers most generation workloads. Track inference logs with the ELK stack for debugging failed requests at the gateway layer.

Kong vs Traefik for AI

Choose Kong when you need rich plugin functionality: advanced rate limiting per consumer, OAuth2 authentication, request transformation, or analytics. Kong’s plugin ecosystem covers most enterprise requirements out of the box.

Choose Traefik when you want minimal configuration overhead and native Docker/Kubernetes integration. Traefik excels at automatic service discovery and TLS certificate management with less operational burden.

Both gateways work well with the FastAPI inference server pattern. For async inference behind the gateway, add a Redis queue and deliver results via webhooks. Our self-hosting guide covers the full infrastructure stack, and the tutorials section has more integration patterns.

Deploy Gated AI APIs on Dedicated GPUs

Run Kong or Traefik in front of your inference stack on bare-metal GPU servers. Rate limiting, auth, and TLS in minutes.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?