Table of Contents
Performance budgets are a discipline borrowed from web performance engineering and applied to AI. Set explicit numerical targets (TTFT < X ms, cost < Y per request); enforce them in CI; fail builds when changes blow the budget. Prevents slow drift toward sluggish, expensive AI features.
For each AI feature, define budgets: p99 TTFT, p99 TPOT, p99 end-to-end, cost per request. Run budget checks in CI on every change. Fail the build if any budget exceeded. Document the trade-offs when budgets are intentionally raised.
Why budgets
- Without budgets: each individual change adds 50 ms; six changes = 300 ms; nobody noticed
- With budgets: each change must justify any latency or cost increase
- Forces explicit conversations about trade-offs (better quality <-> latency cost)
- Preserves user experience over time, not just at launch
Metrics
Per-feature budgets:
- p99 TTFT: time to first token; user-perceived snappiness
- p99 TPOT: time per output token; perceived smoothness during streaming
- p99 end-to-end: full request latency including retrieval + LLM + response shaping
- Cost per request: tokens used × £/M; tracked across self-hosted + fallback API
- Quality eval score: from harness; can't drop below baseline without explicit approval
Enforcement
- CI gates: load test + eval harness + budget check on every PR
- Production monitors: alert when sustained p99 exceeds budget
- Quarterly review: revisit budgets; raise / lower based on user feedback + product priorities
- Documented exceptions: when budget is intentionally raised, document why + approve formally
Verdict
Performance budgets prevent slow drift toward sluggish, expensive AI features. The discipline is borrowed from web perf and applies cleanly to AI. Set budgets at launch; enforce in CI; review quarterly. Without them, your AI feature is on a slow path to unacceptable performance.
Bottom line
Set budgets; enforce in CI. See eval harness.