RTX 3050 - Order Now
Home / Blog / Tutorials / Zero-Downtime Model Swap in Production
Tutorials

Zero-Downtime Model Swap in Production

Upgrading your LLM version should not take your API offline. Here is the pattern for swapping models with zero downtime on a dedicated GPU.

Upgrading the model behind your production API – Llama 3.1 to 3.3, Qwen 2.5 14B to 32B, or a custom fine-tune rollout – should not drop customer requests. On our dedicated GPU hosting the common pattern needs two concurrent vLLM replicas and a load balancer.

Contents

Pattern

  1. Load balancer fronting vLLM replica A on port 8001
  2. Start replica B on port 8002 loading the new model
  3. Wait for B’s health check to pass
  4. Update load balancer to route to B
  5. Drain A (wait for in-flight requests to complete)
  6. Stop A

Clients see zero dropped requests. In-flight requests on A finish on A; new requests go to B.

GPU Budget

For a few minutes you are running both replicas. If both fit on the same GPU (smaller model) you can do this on one card. If not, you need a second GPU briefly. On a multi-GPU server this is trivial – dedicate one card to A and one to B. On a single-GPU server, either live with a brief outage or do an in-place swap with short downtime (typically 10-30s).

Script

#!/bin/bash
set -e

# Start B
CUDA_VISIBLE_DEVICES=1 vllm serve new-model --port 8002 &
NEW_PID=$!

# Wait for B to be healthy
until curl -sf http://localhost:8002/health; do sleep 2; done

# Switch nginx upstream
sed -i 's|server 127.0.0.1:8001|server 127.0.0.1:8002|' /etc/nginx/conf.d/llm.conf
nginx -s reload

# Drain A
sleep 60

# Stop A
kill $OLD_PID
wait $OLD_PID 2>/dev/null || true
echo "Swap complete. New PID: $NEW_PID"

Rollback

If B shows issues after cutover, flip nginx back to A (still running until the drain step is complete). Leave A running for 5-10 minutes post-cutover specifically to enable quick rollback.

Zero-Downtime LLM Hosting

Multi-GPU UK dedicated hosting with rolling deployment patterns preconfigured.

Browse GPU Servers

See blue-green deployment and canary rollout.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?