Table of Contents
One GPU server has its capacity ceiling. The next problem is splitting traffic across multiple servers without losing prefix-cache hits or breaking latency SLAs.
For 2-3 GPU servers behind a load balancer: LiteLLM with latency-based routing. For 4+ servers with prefix caching: session-affinity routing by user_id. For 10+ servers, consider Ray Serve or Triton.
When to scale out
- Single server queue depth consistently > 100
- p99 TTFT consistently > 1 s
- You can’t fit on the largest available card
- You need redundancy for SLA
Load balancing patterns
- Round-robin: simplest. Loses prefix cache hits across servers.
- Latency-based: route to fastest-responding. LiteLLM does this.
- Least connections: route to server with fewest active sequences.
- Session affinity: hash user_id → server. Preserves prefix cache.
- KV-cache aware: route based on which server already has the prompt prefix cached. Optimal but complex.
KV-cache-aware routing
vLLM exposes /v1/cache_status showing which prefixes are cached. A custom router can hash the prompt prefix and route to the server that already has it.
Net effect: 30-50% better cache hit rate vs round-robin. Worth the complexity at 4+ server scale.
Verdict
Start with LiteLLM latency-based routing. Move to session affinity at ~5 servers. Move to KV-cache-aware only when justified by metrics.
Bottom line
Multi-server LLM serving is mostly an exercise in preserving prefix cache hit rate. See monitoring guide.