Table of Contents
Production AI on-call has specific patterns. The alert-able events are different from typical web services — generative output quality regressions, model drift, hosted-API fallback failures join the standard latency / error rate / capacity alerts.
On-call alerts that matter: GPU thermal / hardware, vLLM queue depth spike, p99 TTFT > SLO, hosted-API fallback failure, eval score drop > threshold, structured-output validation failure rate. Rotate weekly with primary + secondary. Runbook each alert: triage step, mitigation, when to escalate, how to verify recovery.
What to alert on
Alert-worthy (page someone):
- GPU temp > 90°C sustained — hardware issue
- p99 TTFT > 2× SLO for 5+ minutes — capacity or model issue
- vLLM queue depth > 100 — capacity exhaustion
- Error rate > 5% — service health
- Hosted-API fallback unreachable — graceful degradation broken
- Eval score drop > 5% on shadow traffic — quality regression
Watch-worthy (dashboard, not page):
- GPU temp 82-90°C
- Cache hit rate dropping
- Cost per token rising
- User feedback "not helpful" rate increasing
Rotation
- Weekly rotation with primary + secondary on-call
- Hand-off Monday with state-of-the-system briefing
- Maximum 1 in 4 weeks for sustainability
- Compensate appropriately (financial or time-off-in-lieu)
- Junior engineers shadow before primary on-call
Runbooks
Each alert needs a runbook with:
- Triage steps (which dashboards, which logs)
- Mitigation actions (route traffic, restart service, scale)
- When to escalate (timing, who to call)
- How to verify recovery (which metrics to watch)
- Post-incident actions
Verdict
On-call for production AI follows standard SRE patterns plus AI-specific extensions (eval drift, generative-quality regressions). Build the runbooks before you need them; rotate fairly; learn from every page. The first incident without a runbook costs more than writing twenty runbooks.
Bottom line
Standard SRE practices + AI-specific alerts. See incident response.