Table of Contents
AI incidents have predictable shapes. A standard postmortem template makes recurring causes obvious.
Postmortem template: 1) Timeline, 2) Impact, 3) Root cause (use one of 8 categories), 4) Detection (how long to detect?), 5) Mitigation (what stopped the bleed?), 6) Action items (concrete, owned, dated). Track the categories over time.
Template
- Timeline (when started, when detected, when mitigated, when resolved)
- Impact (which users, how many, what symptom)
- Root cause (one of 8 categories below)
- Detection (alarm fired, customer complained, etc.)
- Mitigation (fallback, restart, rollback)
- Action items
Root cause categories
- OOM under load
- Driver crash
- Model commit / config change
- Disk full
- Thermal throttling
- Network / proxy
- Auth / rate limit
- External dependency
Verdict
Track root cause categories. Repeat causes signal architectural debt.
Bottom line
Postmortem ritual matters. See incident runbook.