RTX 3050 - Order Now
Home / Blog / Tutorials / Rolling Model Upgrade on an Inference Server
Tutorials

Rolling Model Upgrade on an Inference Server

Replace replicas one at a time with the new model version. Cheaper than blue-green when you have multiple GPUs in one chassis.

Rolling upgrades replace inference replicas one at a time. Unlike blue-green, you do not need double capacity – you just need enough extra headroom to be missing one replica temporarily. On our dedicated GPU hosting it is the cheapest pattern for multi-GPU chassis.

Contents

When

Rolling suits:

  • Multiple replicas on one server (3-8 GPUs)
  • Stateless inference (which most LLM serving is)
  • Model size that fits a single GPU (data parallel replicas)

It does not suit tensor-parallel deployments where one model spans all GPUs – you cannot easily take one replica out of a TP-N group.

Sequence

  1. Load balancer has 4 replicas configured
  2. Mark replica 1 draining (LB stops sending new requests)
  3. Wait for in-flight requests to finish (30-120 seconds)
  4. Stop replica 1, start new version on same GPU
  5. Wait for health check
  6. Mark replica 1 active again
  7. Repeat for replicas 2, 3, 4

During each step you are serving at 75% capacity (3 of 4 replicas). Plan the rollout for low-traffic periods or ensure baseline traffic fits in N-1 replicas.

Graceful Drain

nginx-level:

# Mark server as draining - no new connections
upstream llm {
    server replica1:8000 down;
    server replica2:8000;
    ...
}
nginx -s reload

In-flight requests continue to completion. New requests route elsewhere. After drain, stop the process.

Caveats

During rollout you have mixed versions serving. For user-facing chat, this can mean slightly different behaviour between consecutive requests. If that matters (strictly consistent responses required), use blue-green instead.

For API customers who see occasional request retries or slight behaviour changes, rolling upgrade is fine.

Rolling Upgrade Friendly Hosting

Multi-GPU UK dedicated chassis ready for rolling model deployments.

Browse GPU Servers

See zero-downtime swap and graceful shutdown vLLM.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?