Table of Contents
Updating the model behind a production AI feature is higher-risk than typical software deploys — model output is generative and can regress unpredictably. The blue-green pattern, adapted for AI, is the safer path.
Run new model alongside old (blue-green) on different ports. Route fraction of traffic via feature flag. Monitor eval scores + user feedback for 7-14 days. Promote to 100% only when eval scores match or exceed baseline. Always keep old version warm for instant rollback.
Pattern
- Stand up new model version on a separate vLLM process / port
- LiteLLM router or feature flag splits traffic between old and new
- Eval harness runs continuously against both
- Promote based on eval score + user feedback gates
- Rollback: flip flag; both versions stay warm during rollout window
- Decommission old: only after monitoring period (7-14 days minimum)
Eval-driven gating
Three eval gates before promoting a new model:
- Quality eval: representative prompts; new score ≥ baseline (or within 1-2%)
- Safety eval: harmful-output regression check; new model passes safety bar
- Cost / latency: new model within acceptable cost + latency envelope
If any gate fails, hold rollout; investigate before retrying.
Rollout
Standard rollout cadence:
- Day 0-1: 5% traffic, internal users only
- Day 1-3: 25% traffic, including production users
- Day 3-7: 75% traffic if metrics hold
- Day 7-14: 100% traffic; old version stays warm
- Day 14+: decommission old version
Verdict
For production AI, blue-green model rollout is the standard pattern. Eval-driven gating + gradual traffic shift + always-warm rollback path catches regressions before users do. Skip the gradual rollout and you'll learn the lesson when the new model unexpectedly regresses on a workload your eval didn't cover.
Bottom line
Eval-gated blue-green is the safe pattern. See deployment checklist.