Table of Contents
Canary deployment is only valuable if rollback works reliably and quickly. The rollback path needs to be tested, fast (< 1 minute), and complete (no lingering canary state). Three mechanics matter: feature-flag-driven traffic redirection, warm previous version, in-flight request handling.
Rollback in < 1 minute via feature flag flip. Previous version stays warm during canary window (24-48 hours typical). In-flight requests on canary complete; new requests route to previous. Verify rollback via metric drop in errors / eval scores. Document the rollback decision: what triggered, what was learnt, what would change.
Triggers
Automatic rollback triggers:
- Error rate > 2× baseline for 2 minutes
- p99 TTFT > 2× SLO for 5 minutes
- Eval score on canary traffic drops > 5%
- Manual: on-call engineer sees user feedback regression
Configure via Prometheus alert → Alertmanager → webhook to feature flag service. Or one-click rollback runbook for engineers.
Speed
Target: < 1 minute from rollback decision to traffic restored to previous version.
- Feature flag flip: instantaneous; LiteLLM router picks up immediately
- DNS-based traffic shift: 60s+ depending on TTL; not the fast path
- Load balancer reconfiguration: 30-60s; viable as fallback
- Service restart: too slow for AI rollback — previous version must already be running
Verification
Post-rollback verification:
- Error rate returns to baseline within 2-3 minutes
- p99 TTFT returns to baseline
- Eval scores on representative prompts at baseline
- User feedback dashboard shows recovery
- Document the incident: what triggered, what was the actual cause, what would change for next attempt
Verdict
Rollback is the safety valve that makes canary deployment safe. Test it — quarterly drill at minimum. Sub-1-minute rollback via feature flag is the standard. Slower rollback paths (DNS, restart) are acceptable as fallback but shouldn't be the primary mechanism.
Bottom line
Sub-1-minute rollback via feature flag. See canary pattern.