RTX 3050 - Order Now
Home / Blog / Tutorials / Load Balancer in Front of vLLM – Patterns That Work
Tutorials

Load Balancer in Front of vLLM – Patterns That Work

Two vLLM replicas need a load balancer. Picking the right algorithm and the right tool prevents uneven load, broken streaming, and failed health checks.

Scaling vLLM horizontally means running multiple replicas behind a load balancer. The choice of balancer and algorithm is not trivial – LLM requests are long-lived, bursty, and streaming. On dedicated GPU hosting the right pattern depends on your client mix.

Contents

Algorithms

  • Round robin: simplest. Works if all requests are similar weight.
  • Least connections: better for LLMs because active request count correlates with load.
  • Least response time: even better, but requires active health checks.
  • Consistent hashing: useful if prefix caching – same client hits same backend to benefit from cached prefixes.

Tool Choice

ToolStrengths
nginxMature, streaming-safe, good enough for most
HAProxyBest observability, advanced algorithms
CaddySimplest config, automatic TLS
EnvoyHeavy but best for complex routing

Default to nginx unless you need HAProxy’s observability.

nginx Example

upstream vllm_backend {
    least_conn;
    server 10.0.0.10:8000 max_fails=3 fail_timeout=30s;
    server 10.0.0.11:8000 max_fails=3 fail_timeout=30s;
    server 10.0.0.12:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 443 ssl http2;
    location /v1/ {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
    }
}

Streaming

Three things matter for SSE streaming LLM responses:

  • proxy_buffering off – or chunks stall until buffer fills
  • No session affinity needed for stateless completions
  • Generous proxy_read_timeout (1h+) – long responses take minutes

Load-Balanced vLLM Hosting

Pre-built nginx + multi-replica vLLM on UK dedicated GPU servers.

Browse GPU Servers

See nginx config for OpenAI API and data vs tensor parallel.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?