RTX 3050 - Order Now
Home / Blog / Tutorials / Blue-Green Deployment for AI Services
Tutorials

Blue-Green Deployment for AI Services

Zero-downtime deploys for vLLM and AI services using the blue-green pattern. Specific gotchas for stateful inference.

Blue-green deployment for AI services follows the standard pattern with three twists: long cold-start times mean you can't skip the warm-up, in-flight requests need draining, and stateful caches (prefix cache, semantic cache) may need invalidation depending on what changed.

TL;DR

Run two parallel deployments (blue / green); load balancer points at one. Deploy new version to idle pool, warm up, run smoke tests + eval harness, flip load balancer, drain old. Cold-start ~30-90s per replica; budget for it. Prefix cache resets per process; semantic cache survives if separate.

Pattern

  1. Two replica pools: blue (current production) + green (idle / target)
  2. Deploy new version to green pool
  3. Wait for green to fully start (vLLM cold-start ~30-90s)
  4. Run smoke test + eval harness against green
  5. Send 1% production traffic to green; monitor for 5-10 minutes
  6. Ramp to 100%: 1% → 25% → 75% → 100% over 30-60 minutes
  7. Drain in-flight requests on blue (graceful shutdown ~60 seconds)
  8. Decommission blue pool (or keep warm for instant rollback)

AI-specific gotchas

  • Cold start: ~30-90s per replica for vLLM start + first-request warm-up. Don't flip until ready.
  • Prefix cache reset: each new process starts cold. First-N requests after flip may be slower until cache warms.
  • Semantic cache survival: if running in separate Redis / Qdrant process, survives. If in-process, resets.
  • Long-running streaming requests: may take 60+ seconds to drain. Set TimeoutStopSec=120 or longer.
  • Model version coordination: prompt templates may need to update with model version — deploy as a unit.

Rollback

Blue-green's killer feature is instant rollback: flip load balancer back, blue pool is still warm. ~30 seconds to full rollback if blue not yet decommissioned. After decommission: standard re-deploy path (~5 minutes).

Best practice: keep blue pool warm for 24-48 hours after green takes 100% traffic. The cost is one duplicated GPU's worth of capacity for two days; the value is instant rollback insurance.

Verdict

For production AI, blue-green is the right deploy pattern. The cold-start cost is real but manageable. Always run eval harness on the green pool before flipping. The blue pool stays warm long enough that rollback is instant if needed.

Bottom line

Standard blue-green + AI-specific timing. See graceful shutdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?