A runbook is the difference between "3am panicked debug" and "follow the steps, get back to bed". The template is straightforward; the discipline is keeping it current as the system evolves.
Per-incident-class runbook with: symptoms (alerts, user reports), triage (which dashboards), diagnosis (likely causes, ordered by frequency), mitigation (fast fixes), recovery (verification), escalation (when + who), post-mortem (template). Keep in repo as Markdown; review quarterly with on-call rotation.
Structure
For each recurring incident class, one runbook with these sections:
- Symptoms: what alerts fire / what users report
- Triage: first 30 seconds — which dashboards confirm class
- Diagnosis: ordered list of likely causes; how to identify which
- Mitigation: fast-fix actions before deeper diagnosis (route traffic, restart, scale)
- Recovery verification: which metrics return to baseline
- Escalation: when on-call should escalate, to whom
- Post-mortem: what to capture; deadline
Sections
Symptoms section — concrete:
- Alert names that fire
- User-facing manifestations
- Distinguishing features vs other incident classes
Triage section — first 30 seconds:
- Which Grafana dashboard URL
- Which logs to check
- Which signal confirms vs rules out
Mitigation — fast actions:
- Specific commands / button clicks
- Order of operations
- Expected outcome of each step
Examples
Common AI runbooks:
- vLLM queue overflow / 503s
- GPU thermal throttling
- p99 TTFT spike
- Hosted-API fallback unreachable
- Eval score regression detected on shadow traffic
- Vector store query latency spike
- OOM on vLLM startup
- Cost-per-token regression
Verdict
Runbooks for the 8-12 common AI incident classes are essential. Per runbook ~30-60 minutes to write; together a few days of focused work. Update each time an incident exposes a gap. The investment pays off the first 3am page that resolves in 15 minutes instead of 90.
Bottom line
Runbook per incident class. See on-call rotation.