Table of Contents
Production AI fails in specific recurring ways. The catalogue is finite; each mode has known detection patterns and mitigations. Operating from a checklist beats firefighting.
Six failure classes: (1) hardware (GPU, network, disk), (2) capacity (queue overflow, OOM), (3) quality (eval drift, hallucination spike), (4) safety (jailbreak success, harmful output), (5) integration (hosted-API failure, vector store down), (6) data (corruption, deletion). Each has detection patterns + standard mitigations.
Failure classes
- Hardware: GPU thermal, ECC errors, PCIe link, NVMe failure
- Capacity: vLLM queue overflow, GPU OOM, KV cache exhaustion
- Latency: p99 TTFT spike, decoding stall, cold-start during deploy
- Quality: eval drift, hallucination on routine queries, format-validation failures
- Safety: jailbreak success, prompt injection, harmful output
- Integration: hosted-API fallback unreachable, vector store query failure, embedding service down
- Data: vector store corruption, log volume disk fill, training data leakage
Detection
| Class | Detection |
|---|---|
| Hardware | DCGM exporter alerts |
| Capacity | vLLM queue depth + p99 latency alerts |
| Latency | Prometheus alerts on histogram percentiles |
| Quality | Eval harness on shadow traffic + user feedback |
| Safety | Output classifier + manual sampling |
| Integration | Health check + dependency monitoring |
| Data | Backup verification + corruption detection |
Mitigation
- Hardware: reduce power cap, replace card; failover to standby
- Capacity: shed load to fallback; scale replicas; increase rate limits temporarily
- Latency: route to fallback; investigate; capacity-add if sustained
- Quality: rollback to previous model / prompt version
- Safety: add output filter; tighten input sanitisation; rollback if model
- Integration: failover; degrade gracefully; alert dependency owner
- Data: restore from backup; investigate corruption source
Verdict
Knowing the failure mode catalogue ahead of time turns 3am incidents into "follow the runbook" routine. Each class has ~3-5 specific scenarios; mitigation is documented; recovery is bounded. Build runbooks per class; review quarterly with on-call rotation.
Bottom line
Catalogue failure modes; runbook each. See incident runbook.