Home / Blog / Tutorials / Zero-Downtime Model Swap in Production

Tutorials

Zero-Downtime Model Swap in Production

Upgrading your LLM version should not take your API offline. Here is the pattern for swapping models with zero downtime on a dedicated GPU.

Tutorials April 23, 2026 2 min read admin

Upgrading the model behind your production API – Llama 3.1 to 3.3, Qwen 2.5 14B to 32B, or a custom fine-tune rollout – should not drop customer requests. On our dedicated GPU hosting the common pattern needs two concurrent vLLM replicas and a load balancer.

Pattern
GPU budget
Swap script
Rollback

Pattern

Load balancer fronting vLLM replica A on port 8001
Start replica B on port 8002 loading the new model
Wait for B’s health check to pass
Update load balancer to route to B
Drain A (wait for in-flight requests to complete)
Stop A

Clients see zero dropped requests. In-flight requests on A finish on A; new requests go to B.

GPU Budget

For a few minutes you are running both replicas. If both fit on the same GPU (smaller model) you can do this on one card. If not, you need a second GPU briefly. On a multi-GPU server this is trivial – dedicate one card to A and one to B. On a single-GPU server, either live with a brief outage or do an in-place swap with short downtime (typically 10-30s).

Script

#!/bin/bash
set -e

# Start B
CUDA_VISIBLE_DEVICES=1 vllm serve new-model --port 8002 &
NEW_PID=$!

# Wait for B to be healthy
until curl -sf http://localhost:8002/health; do sleep 2; done

# Switch nginx upstream
sed -i 's|server 127.0.0.1:8001|server 127.0.0.1:8002|' /etc/nginx/conf.d/llm.conf
nginx -s reload

# Drain A
sleep 60

# Stop A
kill $OLD_PID
wait $OLD_PID 2>/dev/null || true
echo "Swap complete. New PID: $NEW_PID"

Rollback

If B shows issues after cutover, flip nginx back to A (still running until the drain step is complete). Leave A running for 5-10 minutes post-cutover specifically to enable quick rollback.

Zero-Downtime LLM Hosting

Multi-GPU UK dedicated hosting with rolling deployment patterns preconfigured.

Browse GPU Servers

See blue-green deployment and canary rollout.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Zero-Downtime Model Swap in Production

Contents

Pattern

GPU Budget

Script

Rollback

Zero-Downtime LLM Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Zero-Downtime Model Swap in Production

Contents

Pattern

GPU Budget

Script

Rollback

Zero-Downtime LLM Hosting

Need a Dedicated GPU Server?

admin

Related Articles

Medical Report Processing with OCR and LLM

vLLM Out of Memory: How to Fix KV Cache OOM

Streamlit AI App on Dedicated GPU

Ollama API Not Responding: Debug

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?