Exposing vLLM or Ollama to the internet directly is a bad idea. nginx sits in front, terminates TLS, enforces auth, handles rate limiting, and keeps streaming working. On our dedicated GPU hosting this is a standard pattern. Here is the config that actually works.
Contents
Base Config
upstream llm {
server 127.0.0.1:8000;
}
server {
listen 443 ssl http2;
server_name api.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
client_max_body_size 20M;
location /v1/ {
proxy_pass http://llm;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
chunked_transfer_encoding on;
}
}
TLS
Use Let’s Encrypt via certbot for free renewable certs:
certbot certonly --nginx -d api.yourdomain.com
A cron or systemd timer handles renewal automatically.
Auth
Two viable patterns:
Bearer token at nginx layer – simple, no app changes:
location /v1/ {
if ($http_authorization != "Bearer your-secret-key") {
return 401;
}
proxy_pass http://llm;
...
}
Let vLLM enforce it – start vLLM with --api-key and pass the Authorization header through. This is cleaner and lets vLLM report auth failures in its own logs.
Rate Limiting
limit_req_zone $binary_remote_addr zone=llm:10m rate=30r/m;
location /v1/ {
limit_req zone=llm burst=10 nodelay;
...
}
30 requests/minute per IP with a 10-request burst. Adjust to your traffic shape. For multi-tenant SaaS, key by API key rather than IP – see vLLM behind nginx with auth.
Production-Ready LLM API Hosting
nginx + vLLM preconfigured with TLS and auth on UK dedicated GPUs.
Browse GPU Servers