Loading a large language model takes minutes — minutes where your inference API returns errors or queues requests indefinitely. Blue-green deployment eliminates that downtime by running two identical environments and switching traffic only after the new model is fully loaded and validated on your dedicated GPU server.
How Blue-Green Works for AI
Standard blue-green deployment runs two production environments. One serves live traffic (blue), while the other sits idle or receives the next deployment (green). For AI inference, the pattern addresses the model loading problem: GPU models take 30 seconds to several minutes to initialise, and you cannot serve requests during that window.
| Phase | Blue Environment | Green Environment | Traffic |
|---|---|---|---|
| Steady state | Serving (active) | Idle or previous version | 100% blue |
| Deploy | Serving (active) | Loading new model | 100% blue |
| Validate | Serving (active) | Ready, running health checks | 100% blue |
| Switch | Draining | Serving (active) | 100% green |
| Cleanup | Stopped or standby | Serving (active) | 100% green |
The key advantage: live traffic never hits a loading model. Users see zero interruption during the switch.
Docker Compose Setup
Run blue and green as separate containers on the same GPU server. Nginx routes traffic to whichever environment is active.
# docker-compose.yml
version: "3.8"
services:
vllm-blue:
image: vllm/vllm-openai:latest
container_name: vllm-blue
command:
- "--model"
- "${BLUE_MODEL:-meta-llama/Llama-3.1-8B-Instruct}"
- "--port"
- "8000"
- "--gpu-memory-utilization"
- "0.45"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 10s
timeout: 5s
retries: 3
networks:
- ai-net
vllm-green:
image: vllm/vllm-openai:latest
container_name: vllm-green
command:
- "--model"
- "${GREEN_MODEL:-meta-llama/Llama-3.1-8B-Instruct}"
- "--port"
- "8000"
- "--gpu-memory-utilization"
- "0.45"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 10s
timeout: 5s
retries: 3
networks:
- ai-net
nginx:
image: nginx:latest
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
ports:
- "80:80"
depends_on:
- vllm-blue
- vllm-green
networks:
- ai-net
networks:
ai-net:
Note the gpu-memory-utilization is set to 0.45 for each container, reserving half the GPU VRAM per environment. On a single GPU, this limits model size. For larger models, use a multi-GPU server or deploy blue and green on separate GPUs. See our vLLM vs Ollama comparison for backend selection.
Nginx Traffic Routing
Nginx switches traffic between blue and green using an upstream configuration that you update during deployment.
# nginx.conf
upstream ai_backend {
server vllm-blue:8000;
# server vllm-green:8000; # Uncomment to switch
}
server {
listen 80;
server_name api.yourdomain.com;
location /v1 {
proxy_pass http://ai_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 120s;
proxy_buffering off; # Required for streaming
}
location /health {
proxy_pass http://ai_backend/health;
}
}
Switching traffic is a config update and Nginx reload — a sub-second operation. No connections are dropped during the reload.
Automated Deployment Script
Automate the full blue-green cycle: deploy to the inactive environment, validate, switch traffic, and optionally roll back.
#!/bin/bash
# scripts/blue_green_deploy.sh
set -e
NEW_MODEL="$1"
COMPOSE_FILE="docker-compose.yml"
NGINX_CONF="./nginx.conf"
# Determine which environment is active
if grep -q "vllm-blue:8000" "$NGINX_CONF" | grep -v "^.*#"; then
ACTIVE="blue"
INACTIVE="green"
else
ACTIVE="green"
INACTIVE="blue"
fi
echo "Active: $ACTIVE | Deploying to: $INACTIVE"
# Deploy new model to inactive environment
export "${INACTIVE^^}_MODEL=$NEW_MODEL"
docker compose -f "$COMPOSE_FILE" up -d "vllm-${INACTIVE}"
# Wait for model to load and become healthy
echo "Waiting for $INACTIVE to become healthy..."
for i in $(seq 1 60); do
if docker inspect "vllm-${INACTIVE}" --format='{{.State.Health.Status}}' 2>/dev/null | grep -q "healthy"; then
echo "$INACTIVE is healthy after ${i}0 seconds"
break
fi
if [ "$i" -eq 60 ]; then
echo "ERROR: $INACTIVE failed to become healthy"
docker compose -f "$COMPOSE_FILE" stop "vllm-${INACTIVE}"
exit 1
fi
sleep 10
done
# Validate with inference test
echo "Running validation..."
INACTIVE_PORT=$(docker port "vllm-${INACTIVE}" 8000 | cut -d: -f2)
python scripts/validate_model.py --endpoint "http://localhost:${INACTIVE_PORT}/v1"
if [ $? -ne 0 ]; then
echo "Validation failed, aborting deployment"
docker compose -f "$COMPOSE_FILE" stop "vllm-${INACTIVE}"
exit 1
fi
# Switch traffic
echo "Switching traffic to $INACTIVE..."
sed -i "s/vllm-${ACTIVE}:8000/vllm-${INACTIVE}:8000/" "$NGINX_CONF"
# Comment out old, uncomment new
docker compose -f "$COMPOSE_FILE" exec nginx nginx -s reload
echo "Traffic switched to $INACTIVE"
# Smoke test against live endpoint
sleep 3
python scripts/smoke_test.py --endpoint "http://localhost:80/v1"
if [ $? -ne 0 ]; then
echo "Smoke test failed! Rolling back..."
sed -i "s/vllm-${INACTIVE}:8000/vllm-${ACTIVE}:8000/" "$NGINX_CONF"
docker compose -f "$COMPOSE_FILE" exec nginx nginx -s reload
echo "Rolled back to $ACTIVE"
exit 1
fi
# Optionally stop old environment to reclaim GPU memory
echo "Stopping $ACTIVE environment..."
docker compose -f "$COMPOSE_FILE" stop "vllm-${ACTIVE}"
echo "Deployment complete. $INACTIVE is now active with $NEW_MODEL"
Multi-GPU Blue-Green
On servers with multiple GPUs, assign each environment to a dedicated device. This allows full VRAM utilisation per model and avoids the memory split required on single-GPU setups.
# Multi-GPU assignment
services:
vllm-blue:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0
vllm-green:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=1
With dedicated GPUs, both environments can run simultaneously at full capacity, enabling instant traffic switching and even canary deployments where a fraction of traffic tests the new model before full cutover.
Production Integration
Wire blue-green deployment into your CI/CD pipeline so every model update triggers an automated deployment cycle. Track active versions with the model versioning system. Monitor both environments with Prometheus and Grafana to compare performance before and after switching.
Route traffic through an API gateway for additional control over the switch. Log inference requests through the ELK stack to verify the cutover is clean. The self-hosting guide covers base infrastructure, and our tutorials section has additional deployment patterns for vLLM production workloads.
Zero-Downtime AI Deployments on Dedicated GPUs
Run blue-green deployments on bare-metal GPU servers. Switch models without dropping a single inference request.
Browse GPU Servers