Home / Blog / Tutorials / Rolling Model Upgrade on an Inference Server

Tutorials

Rolling Model Upgrade on an Inference Server

Replace replicas one at a time with the new model version. Cheaper than blue-green when you have multiple GPUs in one chassis.

Tutorials April 23, 2026 2 min read admin

Rolling upgrades replace inference replicas one at a time. Unlike blue-green, you do not need double capacity – you just need enough extra headroom to be missing one replica temporarily. On our dedicated GPU hosting it is the cheapest pattern for multi-GPU chassis.

When rolling is right
Upgrade sequence
Graceful drain
Caveats

When

Rolling suits:

Multiple replicas on one server (3-8 GPUs)
Stateless inference (which most LLM serving is)
Model size that fits a single GPU (data parallel replicas)

It does not suit tensor-parallel deployments where one model spans all GPUs – you cannot easily take one replica out of a TP-N group.

Sequence

Load balancer has 4 replicas configured
Mark replica 1 draining (LB stops sending new requests)
Wait for in-flight requests to finish (30-120 seconds)
Stop replica 1, start new version on same GPU
Wait for health check
Mark replica 1 active again
Repeat for replicas 2, 3, 4

During each step you are serving at 75% capacity (3 of 4 replicas). Plan the rollout for low-traffic periods or ensure baseline traffic fits in N-1 replicas.

Graceful Drain

nginx-level:

# Mark server as draining - no new connections
upstream llm {
    server replica1:8000 down;
    server replica2:8000;
    ...
}
nginx -s reload

In-flight requests continue to completion. New requests route elsewhere. After drain, stop the process.

Caveats

During rollout you have mixed versions serving. For user-facing chat, this can mean slightly different behaviour between consecutive requests. If that matters (strictly consistent responses required), use blue-green instead.

For API customers who see occasional request retries or slight behaviour changes, rolling upgrade is fine.

Rolling Upgrade Friendly Hosting

Multi-GPU UK dedicated chassis ready for rolling model deployments.

Browse GPU Servers

See zero-downtime swap and graceful shutdown vLLM.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Rolling Model Upgrade on an Inference Server

Contents

When

Sequence

Graceful Drain

Caveats

Rolling Upgrade Friendly Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Rolling Model Upgrade on an Inference Server

Contents

When

Sequence

Graceful Drain

Caveats

Rolling Upgrade Friendly Hosting

Need a Dedicated GPU Server?

admin

Related Articles

Ollama Model Pull Fails: Network Fix

LoRAX Multi-LoRA Serving on a Dedicated GPU

Kubernetes for AI: GPU Pod Config

Naive RAG vs Advanced RAG vs Graph RAG: Architecture Comparison

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?