RTX 3050 - Order Now
Home / Blog / Tutorials / Blue-Green Deployment for AI Models
Tutorials

Blue-Green Deployment for AI Models

Complete guide to implementing blue-green deployment for AI models on GPU servers covering zero-downtime switching, traffic routing, health validation, automated rollback, and production cutover procedures.

Loading a large language model takes minutes — minutes where your inference API returns errors or queues requests indefinitely. Blue-green deployment eliminates that downtime by running two identical environments and switching traffic only after the new model is fully loaded and validated on your dedicated GPU server.

How Blue-Green Works for AI

Standard blue-green deployment runs two production environments. One serves live traffic (blue), while the other sits idle or receives the next deployment (green). For AI inference, the pattern addresses the model loading problem: GPU models take 30 seconds to several minutes to initialise, and you cannot serve requests during that window.

PhaseBlue EnvironmentGreen EnvironmentTraffic
Steady stateServing (active)Idle or previous version100% blue
DeployServing (active)Loading new model100% blue
ValidateServing (active)Ready, running health checks100% blue
SwitchDrainingServing (active)100% green
CleanupStopped or standbyServing (active)100% green

The key advantage: live traffic never hits a loading model. Users see zero interruption during the switch.

Docker Compose Setup

Run blue and green as separate containers on the same GPU server. Nginx routes traffic to whichever environment is active.

# docker-compose.yml
version: "3.8"
services:
  vllm-blue:
    image: vllm/vllm-openai:latest
    container_name: vllm-blue
    command:
      - "--model"
      - "${BLUE_MODEL:-meta-llama/Llama-3.1-8B-Instruct}"
      - "--port"
      - "8000"
      - "--gpu-memory-utilization"
      - "0.45"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - ai-net

  vllm-green:
    image: vllm/vllm-openai:latest
    container_name: vllm-green
    command:
      - "--model"
      - "${GREEN_MODEL:-meta-llama/Llama-3.1-8B-Instruct}"
      - "--port"
      - "8000"
      - "--gpu-memory-utilization"
      - "0.45"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - ai-net

  nginx:
    image: nginx:latest
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf
    ports:
      - "80:80"
    depends_on:
      - vllm-blue
      - vllm-green
    networks:
      - ai-net

networks:
  ai-net:

Note the gpu-memory-utilization is set to 0.45 for each container, reserving half the GPU VRAM per environment. On a single GPU, this limits model size. For larger models, use a multi-GPU server or deploy blue and green on separate GPUs. See our vLLM vs Ollama comparison for backend selection.

Nginx Traffic Routing

Nginx switches traffic between blue and green using an upstream configuration that you update during deployment.

# nginx.conf
upstream ai_backend {
    server vllm-blue:8000;
    # server vllm-green:8000;  # Uncomment to switch
}

server {
    listen 80;
    server_name api.yourdomain.com;

    location /v1 {
        proxy_pass http://ai_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 120s;
        proxy_buffering off;  # Required for streaming
    }

    location /health {
        proxy_pass http://ai_backend/health;
    }
}

Switching traffic is a config update and Nginx reload — a sub-second operation. No connections are dropped during the reload.

Automated Deployment Script

Automate the full blue-green cycle: deploy to the inactive environment, validate, switch traffic, and optionally roll back.

#!/bin/bash
# scripts/blue_green_deploy.sh
set -e

NEW_MODEL="$1"
COMPOSE_FILE="docker-compose.yml"
NGINX_CONF="./nginx.conf"

# Determine which environment is active
if grep -q "vllm-blue:8000" "$NGINX_CONF" | grep -v "^.*#"; then
    ACTIVE="blue"
    INACTIVE="green"
else
    ACTIVE="green"
    INACTIVE="blue"
fi

echo "Active: $ACTIVE | Deploying to: $INACTIVE"

# Deploy new model to inactive environment
export "${INACTIVE^^}_MODEL=$NEW_MODEL"
docker compose -f "$COMPOSE_FILE" up -d "vllm-${INACTIVE}"

# Wait for model to load and become healthy
echo "Waiting for $INACTIVE to become healthy..."
for i in $(seq 1 60); do
    if docker inspect "vllm-${INACTIVE}" --format='{{.State.Health.Status}}' 2>/dev/null | grep -q "healthy"; then
        echo "$INACTIVE is healthy after ${i}0 seconds"
        break
    fi
    if [ "$i" -eq 60 ]; then
        echo "ERROR: $INACTIVE failed to become healthy"
        docker compose -f "$COMPOSE_FILE" stop "vllm-${INACTIVE}"
        exit 1
    fi
    sleep 10
done

# Validate with inference test
echo "Running validation..."
INACTIVE_PORT=$(docker port "vllm-${INACTIVE}" 8000 | cut -d: -f2)
python scripts/validate_model.py --endpoint "http://localhost:${INACTIVE_PORT}/v1"
if [ $? -ne 0 ]; then
    echo "Validation failed, aborting deployment"
    docker compose -f "$COMPOSE_FILE" stop "vllm-${INACTIVE}"
    exit 1
fi

# Switch traffic
echo "Switching traffic to $INACTIVE..."
sed -i "s/vllm-${ACTIVE}:8000/vllm-${INACTIVE}:8000/" "$NGINX_CONF"
# Comment out old, uncomment new
docker compose -f "$COMPOSE_FILE" exec nginx nginx -s reload

echo "Traffic switched to $INACTIVE"

# Smoke test against live endpoint
sleep 3
python scripts/smoke_test.py --endpoint "http://localhost:80/v1"
if [ $? -ne 0 ]; then
    echo "Smoke test failed! Rolling back..."
    sed -i "s/vllm-${INACTIVE}:8000/vllm-${ACTIVE}:8000/" "$NGINX_CONF"
    docker compose -f "$COMPOSE_FILE" exec nginx nginx -s reload
    echo "Rolled back to $ACTIVE"
    exit 1
fi

# Optionally stop old environment to reclaim GPU memory
echo "Stopping $ACTIVE environment..."
docker compose -f "$COMPOSE_FILE" stop "vllm-${ACTIVE}"
echo "Deployment complete. $INACTIVE is now active with $NEW_MODEL"

Multi-GPU Blue-Green

On servers with multiple GPUs, assign each environment to a dedicated device. This allows full VRAM utilisation per model and avoids the memory split required on single-GPU setups.

# Multi-GPU assignment
services:
  vllm-blue:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0

  vllm-green:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=1

With dedicated GPUs, both environments can run simultaneously at full capacity, enabling instant traffic switching and even canary deployments where a fraction of traffic tests the new model before full cutover.

Production Integration

Wire blue-green deployment into your CI/CD pipeline so every model update triggers an automated deployment cycle. Track active versions with the model versioning system. Monitor both environments with Prometheus and Grafana to compare performance before and after switching.

Route traffic through an API gateway for additional control over the switch. Log inference requests through the ELK stack to verify the cutover is clean. The self-hosting guide covers base infrastructure, and our tutorials section has additional deployment patterns for vLLM production workloads.

Zero-Downtime AI Deployments on Dedicated GPUs

Run blue-green deployments on bare-metal GPU servers. Switch models without dropping a single inference request.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?