Scaling vLLM horizontally means running multiple replicas behind a load balancer. The choice of balancer and algorithm is not trivial – LLM requests are long-lived, bursty, and streaming. On dedicated GPU hosting the right pattern depends on your client mix.
Contents
Algorithms
- Round robin: simplest. Works if all requests are similar weight.
- Least connections: better for LLMs because active request count correlates with load.
- Least response time: even better, but requires active health checks.
- Consistent hashing: useful if prefix caching – same client hits same backend to benefit from cached prefixes.
Tool Choice
| Tool | Strengths |
|---|---|
| nginx | Mature, streaming-safe, good enough for most |
| HAProxy | Best observability, advanced algorithms |
| Caddy | Simplest config, automatic TLS |
| Envoy | Heavy but best for complex routing |
Default to nginx unless you need HAProxy’s observability.
nginx Example
upstream vllm_backend {
least_conn;
server 10.0.0.10:8000 max_fails=3 fail_timeout=30s;
server 10.0.0.11:8000 max_fails=3 fail_timeout=30s;
server 10.0.0.12:8000 max_fails=3 fail_timeout=30s;
}
server {
listen 443 ssl http2;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
}
}
Streaming
Three things matter for SSE streaming LLM responses:
proxy_buffering off– or chunks stall until buffer fills- No session affinity needed for stateless completions
- Generous
proxy_read_timeout(1h+) – long responses take minutes
Load-Balanced vLLM Hosting
Pre-built nginx + multi-replica vLLM on UK dedicated GPU servers.
Browse GPU ServersSee nginx config for OpenAI API and data vs tensor parallel.