Home / Blog / Tutorials / Blue-Green Deployment for AI Services

Tutorials

Blue-Green Deployment for AI Services

Zero-downtime deploys for vLLM and AI services using the blue-green pattern. Specific gotchas for stateful inference.

Tutorials May 6, 2026 2 min read gigagpu

Table of Contents

Blue-green deployment for AI services follows the standard pattern with three twists: long cold-start times mean you can't skip the warm-up, in-flight requests need draining, and stateful caches (prefix cache, semantic cache) may need invalidation depending on what changed.

TL;DR

Run two parallel deployments (blue / green); load balancer points at one. Deploy new version to idle pool, warm up, run smoke tests + eval harness, flip load balancer, drain old. Cold-start ~30-90s per replica; budget for it. Prefix cache resets per process; semantic cache survives if separate.

Pattern

Two replica pools: blue (current production) + green (idle / target)
Deploy new version to green pool
Wait for green to fully start (vLLM cold-start ~30-90s)
Run smoke test + eval harness against green
Send 1% production traffic to green; monitor for 5-10 minutes
Ramp to 100%: 1% → 25% → 75% → 100% over 30-60 minutes
Drain in-flight requests on blue (graceful shutdown ~60 seconds)
Decommission blue pool (or keep warm for instant rollback)

AI-specific gotchas

Cold start: ~30-90s per replica for vLLM start + first-request warm-up. Don't flip until ready.
Prefix cache reset: each new process starts cold. First-N requests after flip may be slower until cache warms.
Semantic cache survival: if running in separate Redis / Qdrant process, survives. If in-process, resets.
Long-running streaming requests: may take 60+ seconds to drain. Set TimeoutStopSec=120 or longer.
Model version coordination: prompt templates may need to update with model version — deploy as a unit.

Rollback

Blue-green's killer feature is instant rollback: flip load balancer back, blue pool is still warm. ~30 seconds to full rollback if blue not yet decommissioned. After decommission: standard re-deploy path (~5 minutes).

Best practice: keep blue pool warm for 24-48 hours after green takes 100% traffic. The cost is one duplicated GPU's worth of capacity for two days; the value is instant rollback insurance.

Verdict

For production AI, blue-green is the right deploy pattern. The cold-start cost is real but manageable. Always run eval harness on the green pool before flipping. The blue pool stays warm long enough that rollback is instant if needed.

Bottom line

Standard blue-green + AI-specific timing. See graceful shutdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Blue-Green Deployment for AI Services

Pattern

AI-specific gotchas

Rollback

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Blue-Green Deployment for AI Services

Pattern

AI-specific gotchas

Rollback

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect MinIO to GPU for Model Storage

Batch Size Tuning on the RTX 5060 Ti 16 GB: Where Throughput Stops Improving

RAG Deployment on RTX 3090 24 GB: The Cheap Production Stack

Fine-Tune LoRA on RTX 5060 Ti 16GB – Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?