Feature flags are essential for AI feature rollout — model output is generative and non-deterministic; bugs surface in subtle ways. The standard SaaS feature-flag patterns apply, plus AI-specific extensions: per-prompt-version flags, per-model-version flags, and confidence-based fallback routing.
Use a feature flag system (LaunchDarkly, GrowthBook, ConfigCat, or open-source equivalent). Flag granularly: prompt version, model version, RAG strategy, fallback routing. Roll out gradually: 5% → 25% → 75% → 100%. Monitor eval scores + user feedback as ramp gates. Always have rollback path via flag flip.
Why flag
- Generative outputs are non-deterministic — bugs may not surface in unit tests
- Quality regressions are subtle — slight prompt changes can degrade output meaningfully
- Different cohorts may need different behaviour — beta users get new features first
- Fast rollback is critical — feature flag flip beats redeploy by hours
Patterns
- Prompt version flag:
prompt_v3 vs prompt_v4at 5%/95% split - Model version flag:
model_llama_3_1 vs model_llama_3_3 - RAG strategy flag:
hybrid_search_on vs off - Fallback routing flag:
fallback_threshold = 0.7for confidence-based escalation to frontier API - Feature kill switch:
ai_feature_enabled = true/falsefor emergency disable
Metrics
Watch these during AI feature rollouts:
- Eval harness score: per-version score; should hold or improve
- p99 TTFT / TPOT: latency regression detection
- User "helpful" rate: thumbs-up rate or equivalent
- Hosted-API fallback rate: how often new version triggers fallback
- Cost per user: economic regression detection
Verdict
For AI features, feature flags are essential — not optional. The combination of non-deterministic output + subtle regressions + fast rollback need makes flag systems the right primitive. Use them at every layer: prompt, model, retrieval, routing.
Bottom line
Flag every AI rollout. See prompt versioning.