Table of Contents
Blue-green deployment for AI services follows the standard pattern with three twists: long cold-start times mean you can't skip the warm-up, in-flight requests need draining, and stateful caches (prefix cache, semantic cache) may need invalidation depending on what changed.
Run two parallel deployments (blue / green); load balancer points at one. Deploy new version to idle pool, warm up, run smoke tests + eval harness, flip load balancer, drain old. Cold-start ~30-90s per replica; budget for it. Prefix cache resets per process; semantic cache survives if separate.
Pattern
- Two replica pools: blue (current production) + green (idle / target)
- Deploy new version to green pool
- Wait for green to fully start (vLLM cold-start ~30-90s)
- Run smoke test + eval harness against green
- Send 1% production traffic to green; monitor for 5-10 minutes
- Ramp to 100%: 1% → 25% → 75% → 100% over 30-60 minutes
- Drain in-flight requests on blue (graceful shutdown ~60 seconds)
- Decommission blue pool (or keep warm for instant rollback)
AI-specific gotchas
- Cold start: ~30-90s per replica for vLLM start + first-request warm-up. Don't flip until ready.
- Prefix cache reset: each new process starts cold. First-N requests after flip may be slower until cache warms.
- Semantic cache survival: if running in separate Redis / Qdrant process, survives. If in-process, resets.
- Long-running streaming requests: may take 60+ seconds to drain. Set
TimeoutStopSec=120or longer. - Model version coordination: prompt templates may need to update with model version — deploy as a unit.
Rollback
Blue-green's killer feature is instant rollback: flip load balancer back, blue pool is still warm. ~30 seconds to full rollback if blue not yet decommissioned. After decommission: standard re-deploy path (~5 minutes).
Best practice: keep blue pool warm for 24-48 hours after green takes 100% traffic. The cost is one duplicated GPU's worth of capacity for two days; the value is instant rollback insurance.
Verdict
For production AI, blue-green is the right deploy pattern. The cold-start cost is real but manageable. Always run eval harness on the green pool before flipping. The blue pool stays warm long enough that rollback is instant if needed.
Bottom line
Standard blue-green + AI-specific timing. See graceful shutdown.