How to Secure Your AI Inference API (Authentication + Rate Limiting) GIGAGPU

Exposing an AI inference API on the public internet without security is an invitation for abuse. A single unprotected GPU server can rack up thousands of pounds in wasted compute within hours. This tutorial shows you how to lock down your AI inference API with API key authentication, JWT tokens, per-key rate limiting, and IP-based access control. Every configuration example works on Ubuntu 22.04 or 24.04 and is tested with vLLM and Ollama backends.

Table of Contents

API Key Authentication with Nginx
JWT Token Authentication
Per-Key Rate Limiting
IP Whitelisting and Firewall Rules
Python Authentication Middleware
Request Logging and Auditing
Full Security Configuration

API Key Authentication with Nginx

The simplest approach is to validate API keys at the Nginx layer, before requests reach your inference backend. This keeps your GPU-serving process completely shielded from unauthenticated traffic.

Create an API key file and configure Nginx to check the Authorization header:

# Generate API keys
mkdir -p /etc/nginx/auth
python3 -c "import secrets; print(secrets.token_urlsafe(32))" | tee -a /etc/nginx/auth/api_keys.txt
python3 -c "import secrets; print(secrets.token_urlsafe(32))" | tee -a /etc/nginx/auth/api_keys.txt

# Set permissions
sudo chmod 600 /etc/nginx/auth/api_keys.txt
sudo chown www-data:www-data /etc/nginx/auth/api_keys.txt

Create an Nginx map that validates keys from the Authorization header:

# /etc/nginx/conf.d/auth_keys.conf
# Map API keys to client identifiers
map $http_authorization $api_client {
    default "";
    "Bearer sk-abc123yourfirstkeyhere"  "client_a";
    "Bearer sk-def456yoursecondkeyhere" "client_b";
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;

    location /v1/ {
        # Reject requests without a valid API key
        if ($api_client = "") {
            return 401 '{"error": "Invalid or missing API key"}';
        }

        proxy_pass http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Client-ID $api_client;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
        proxy_read_timeout 600s;
    }

    # Health check (no auth required)
    location /health {
        proxy_pass http://127.0.0.1:8000/health;
        access_log off;
    }
}

# Test the configuration
sudo nginx -t && sudo systemctl reload nginx

# Test with valid key
curl -H "Authorization: Bearer sk-abc123yourfirstkeyhere" \
    https://api.yourdomain.com/v1/models

# Test without key (should return 401)
curl -s -o /dev/null -w "%{http_code}" https://api.yourdomain.com/v1/models

For the full Nginx reverse proxy setup including TLS and streaming, see the Nginx reverse proxy for AI inference guide.

JWT Token Authentication

For more sophisticated authentication with token expiry and claims, use JWT validation. Install the Nginx JWT module or implement validation in a Python middleware:

pip install pyjwt cryptography fastapi uvicorn httpx

# Generate a signing key
python3 -c "
import secrets
key = secrets.token_hex(32)
print(f'JWT_SECRET={key}')
" | tee /opt/inference/.env

Create a token generation script for your clients:

#!/usr/bin/env python3
# /opt/inference/generate_token.py
import jwt
import datetime
import sys
import os

SECRET = os.environ.get("JWT_SECRET", "your-secret-key-here")

def generate_token(client_id, expires_days=30, rate_limit=100):
    payload = {
        "sub": client_id,
        "iat": datetime.datetime.utcnow(),
        "exp": datetime.datetime.utcnow() + datetime.timedelta(days=expires_days),
        "rate_limit": rate_limit,  # requests per minute
    }
    token = jwt.encode(payload, SECRET, algorithm="HS256")
    print(f"Client: {client_id}")
    print(f"Token: {token}")
    print(f"Expires: {payload['exp'].isoformat()}")
    return token

if __name__ == "__main__":
    client = sys.argv[1] if len(sys.argv) > 1 else "default"
    generate_token(client)

# Generate tokens for clients
python3 /opt/inference/generate_token.py client_production
python3 /opt/inference/generate_token.py client_staging 7 50

Per-Key Rate Limiting

Rate limiting at the Nginx level prevents any single client from monopolising GPU resources. Configure per-key limits:

# Add to /etc/nginx/nginx.conf (http block)
http {
    # Rate limit by API client identity
    limit_req_zone $api_client zone=per_client:10m rate=10r/s;

    # Global rate limit as a safety net
    limit_req_zone $binary_remote_addr zone=per_ip:10m rate=30r/s;

    # Connection limits
    limit_conn_zone $api_client zone=client_conn:10m;
}

# Add to server block
location /v1/ {
    # Per-client rate limit
    limit_req zone=per_client burst=20 nodelay;

    # Per-IP fallback
    limit_req zone=per_ip burst=50 nodelay;

    # Max concurrent connections per client
    limit_conn client_conn 30;

    # Return 429 on rate limit
    limit_req_status 429;
    limit_conn_status 429;

    proxy_pass http://127.0.0.1:8000;
    proxy_buffering off;
    proxy_read_timeout 600s;
}

For more granular per-key limits, use a Redis-backed approach. This is especially important if you serve multiple clients from the same LLM hosting infrastructure. Use the cost per million tokens calculator to set pricing tiers that align with your rate limits:

sudo apt install -y redis-server
sudo systemctl enable --now redis-server

IP Whitelisting and Firewall Rules

Layer network-level security with UFW firewall rules and Nginx IP restrictions. For proper network architecture on your dedicated GPU server, also refer to the GPU server networking guide:

# Configure UFW firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing

# Allow SSH
sudo ufw allow 22/tcp

# Allow HTTPS only
sudo ufw allow 443/tcp

# Allow from specific client IPs only (more restrictive)
sudo ufw allow from 203.0.113.0/24 to any port 443

# Enable firewall
sudo ufw enable
sudo ufw status verbose

Add IP restrictions in Nginx for defence in depth:

# In your server block
location /v1/ {
    # Allow specific client networks
    allow 203.0.113.0/24;
    allow 198.51.100.0/24;
    allow 10.0.0.0/8;

    # Block everything else
    deny all;

    # ... rest of proxy config
    proxy_pass http://127.0.0.1:8000;
}

For a complete security architecture covering network isolation, see the private AI infrastructure guide.

Python Authentication Middleware

For maximum flexibility, run a FastAPI middleware proxy between Nginx and vLLM that handles auth, rate limiting, and request logging. This middleware approach works with any backend, including Ollama and TensorFlow Serving:

#!/usr/bin/env python3
# /opt/inference/auth_proxy.py
import os
import time
import jwt
import httpx
import redis
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse

app = FastAPI()
redis_client = redis.Redis(host="localhost", port=6379, db=0)

JWT_SECRET = os.environ.get("JWT_SECRET", "your-secret-key")
BACKEND_URL = "http://127.0.0.1:8000"

def validate_token(auth_header: str):
    if not auth_header or not auth_header.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Missing or invalid Authorization header")
    token = auth_header[7:]
    try:
        payload = jwt.decode(token, JWT_SECRET, algorithms=["HS256"])
        return payload
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

def check_rate_limit(client_id: str, limit: int):
    key = f"rate:{client_id}:{int(time.time()) // 60}"
    current = redis_client.incr(key)
    redis_client.expire(key, 120)
    if current > limit:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

@app.api_route("/v1/{path:path}", methods=["GET", "POST"])
async def proxy(request: Request, path: str):
    claims = validate_token(request.headers.get("Authorization", ""))
    check_rate_limit(claims["sub"], claims.get("rate_limit", 60))

    body = await request.body()
    async with httpx.AsyncClient() as client:
        resp = await client.request(
            method=request.method,
            url=f"{BACKEND_URL}/v1/{path}",
            content=body,
            headers={"Content-Type": "application/json"},
            timeout=600.0
        )
        return StreamingResponse(
            content=resp.aiter_bytes(),
            status_code=resp.status_code,
            headers=dict(resp.headers)
        )

# Run the auth proxy
pip install fastapi uvicorn httpx pyjwt redis
uvicorn auth_proxy:app --host 127.0.0.1 --port 9000 --workers 4

Request Logging and Auditing

Log all API requests for billing and abuse detection:

# Custom Nginx log format for AI API
log_format ai_api '$remote_addr - $api_client [$time_local] '
                   '"$request" $status $body_bytes_sent '
                   '"$http_referer" $request_time';

access_log /var/log/nginx/ai_api.log ai_api;

# Analyse API usage per client
awk '{print $3}' /var/log/nginx/ai_api.log | sort | uniq -c | sort -rn

# Find slow requests (over 10 seconds)
awk '$NF > 10.0 {print}' /var/log/nginx/ai_api.log

# Count requests per hour
awk '{print $4}' /var/log/nginx/ai_api.log | cut -d: -f1,2 | uniq -c

Full Security Configuration

Here is a complete secured deployment with all layers combined:

# Full stack deployment with docker-compose
# docker-compose.yml
services:
  vllm:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --gpu-memory-utilization 0.95
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - model-cache:/root/.cache/huggingface
    # Only accessible from auth proxy, not exposed to host
    networks:
      - internal

  auth-proxy:
    build: ./auth-proxy
    environment:
      - JWT_SECRET=${JWT_SECRET}
      - BACKEND_URL=http://vllm:8000
      - REDIS_HOST=redis
    depends_on:
      - vllm
      - redis
    networks:
      - internal
      - external
    ports:
      - "127.0.0.1:9000:9000"

  redis:
    image: redis:7-alpine
    networks:
      - internal

  nginx:
    image: nginx:latest
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
      - /etc/letsencrypt:/etc/letsencrypt:ro
    ports:
      - "443:443"
      - "80:80"
    depends_on:
      - auth-proxy
    networks:
      - external

volumes:
  model-cache:

networks:
  internal:
  external:

For containerised deployments, review the Docker for AI workloads guide. To scale this architecture horizontally, see the auto-scaling AI inference guide. For monitoring request patterns, follow the GPU monitoring tutorial. Use the LLM cost calculator to plan usage-based pricing for your clients. Find more security and deployment guides in the tutorials category.

Secure GPU Servers with Full Root Access

Deploy authenticated AI APIs on dedicated NVIDIA GPU servers. GigaGPU provides DDoS protection, private networking, and the control you need to implement enterprise-grade security.

Browse GPU Servers

How to Secure Your AI Inference API (Authentication + Rate Limiting)

API Key Authentication with Nginx

JWT Token Authentication

Per-Key Rate Limiting

IP Whitelisting and Firewall Rules

Python Authentication Middleware

Request Logging and Auditing

Full Security Configuration

Secure GPU Servers with Full Root Access

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

How to Secure Your AI Inference API (Authentication + Rate Limiting)

API Key Authentication with Nginx

JWT Token Authentication

Per-Key Rate Limiting

IP Whitelisting and Firewall Rules

Python Authentication Middleware

Request Logging and Auditing

Full Security Configuration

Secure GPU Servers with Full Root Access

Need a Dedicated GPU Server?

gigagpu

Related Articles

TGI Quantization Flags Deep Dive

vLLM Multi-LoRA Serving: One Base Model, N Customer Adapters

Fine-Tune Data Curation

SDXL Image Generation API with FastAPI

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?