Rolling upgrades replace inference replicas one at a time. Unlike blue-green, you do not need double capacity – you just need enough extra headroom to be missing one replica temporarily. On our dedicated GPU hosting it is the cheapest pattern for multi-GPU chassis.
Contents
When
Rolling suits:
- Multiple replicas on one server (3-8 GPUs)
- Stateless inference (which most LLM serving is)
- Model size that fits a single GPU (data parallel replicas)
It does not suit tensor-parallel deployments where one model spans all GPUs – you cannot easily take one replica out of a TP-N group.
Sequence
- Load balancer has 4 replicas configured
- Mark replica 1 draining (LB stops sending new requests)
- Wait for in-flight requests to finish (30-120 seconds)
- Stop replica 1, start new version on same GPU
- Wait for health check
- Mark replica 1 active again
- Repeat for replicas 2, 3, 4
During each step you are serving at 75% capacity (3 of 4 replicas). Plan the rollout for low-traffic periods or ensure baseline traffic fits in N-1 replicas.
Graceful Drain
nginx-level:
# Mark server as draining - no new connections
upstream llm {
server replica1:8000 down;
server replica2:8000;
...
}
nginx -s reload
In-flight requests continue to completion. New requests route elsewhere. After drain, stop the process.
Caveats
During rollout you have mixed versions serving. For user-facing chat, this can mean slightly different behaviour between consecutive requests. If that matters (strictly consistent responses required), use blue-green instead.
For API customers who see occasional request retries or slight behaviour changes, rolling upgrade is fine.
Rolling Upgrade Friendly Hosting
Multi-GPU UK dedicated chassis ready for rolling model deployments.
Browse GPU ServersSee zero-downtime swap and graceful shutdown vLLM.