Upgrading the model behind your production API – Llama 3.1 to 3.3, Qwen 2.5 14B to 32B, or a custom fine-tune rollout – should not drop customer requests. On our dedicated GPU hosting the common pattern needs two concurrent vLLM replicas and a load balancer.
Contents
Pattern
- Load balancer fronting vLLM replica A on port 8001
- Start replica B on port 8002 loading the new model
- Wait for B’s health check to pass
- Update load balancer to route to B
- Drain A (wait for in-flight requests to complete)
- Stop A
Clients see zero dropped requests. In-flight requests on A finish on A; new requests go to B.
GPU Budget
For a few minutes you are running both replicas. If both fit on the same GPU (smaller model) you can do this on one card. If not, you need a second GPU briefly. On a multi-GPU server this is trivial – dedicate one card to A and one to B. On a single-GPU server, either live with a brief outage or do an in-place swap with short downtime (typically 10-30s).
Script
#!/bin/bash
set -e
# Start B
CUDA_VISIBLE_DEVICES=1 vllm serve new-model --port 8002 &
NEW_PID=$!
# Wait for B to be healthy
until curl -sf http://localhost:8002/health; do sleep 2; done
# Switch nginx upstream
sed -i 's|server 127.0.0.1:8001|server 127.0.0.1:8002|' /etc/nginx/conf.d/llm.conf
nginx -s reload
# Drain A
sleep 60
# Stop A
kill $OLD_PID
wait $OLD_PID 2>/dev/null || true
echo "Swap complete. New PID: $NEW_PID"
Rollback
If B shows issues after cutover, flip nginx back to A (still running until the drain step is complete). Leave A running for 5-10 minutes post-cutover specifically to enable quick rollback.
Zero-Downtime LLM Hosting
Multi-GPU UK dedicated hosting with rolling deployment patterns preconfigured.
Browse GPU ServersSee blue-green deployment and canary rollout.