For AI features, A/B experiments need careful design. The standard SaaS metrics (conversion, retention) apply, but AI-specific signals (eval score, response quality, hallucination rate) need explicit measurement. Statistical discipline matters because output non-determinism muddies signal.
Metric set: business outcome (conversion, retention) + AI quality (eval score, user feedback) + cost (per-request cost) + latency (p99 TTFT). Random assignment at user / session level; not request level. Statistical significance: more conservative than typical web A/B because output variance is higher. Plan for 2-4 week experiments; smaller effects need longer.
Metrics
- Primary business metric: conversion, retention, NPS, task success
- AI quality: eval harness score on production traffic; user feedback (thumbs / rating)
- Cost per request: tokens + caching + fallback
- Latency: p50 / p99 TTFT, total request time
- Hallucination rate: structured-output validation failures, factual-claim accuracy on sample
- Engagement: re-query rate (high = retrieval bad), session length
Design
- Random assignment at user / session level: not per-request. Within a session, behaviour should be consistent.
- Stratify by tenant tier / region / use case: avoid imbalanced segments
- Power analysis upfront: how big a sample do you need to detect the effect size you care about?
- Run for full business cycle: 2-4 weeks minimum to capture weekly patterns
- Pre-register hypotheses: prevent post-hoc fishing for significant differences
Pitfalls
- Output non-determinism inflates variance: same prompt → different outputs → different user reactions. Compensate with larger samples.
- Caching skews results: variant with better cache hit looks faster / cheaper artificially. Measure post-cache impact.
- User adaptation: users learn to interact with each variant differently; new behaviour confounds metrics.
- Hosted-API rate limits: variant routing to frontier API may degrade unexpectedly; monitor.
Verdict
AI feature A/B experiments need standard rigour plus AI-specific extensions. More conservative significance thresholds; longer runs; pre-registered hypotheses. The discipline pays off in confident decisions; sloppy AB testing produces noise misinterpreted as signal.
Bottom line
Standard A/B + AI-specific metrics. See canary rollback.