RTX 3050 - Order Now
Home / Blog / Tutorials / AI On-Call Runbook Template
Tutorials

AI On-Call Runbook Template

Template runbook for AI on-call — structure, sections, what to include for each incident class.

A runbook is the difference between "3am panicked debug" and "follow the steps, get back to bed". The template is straightforward; the discipline is keeping it current as the system evolves.

TL;DR

Per-incident-class runbook with: symptoms (alerts, user reports), triage (which dashboards), diagnosis (likely causes, ordered by frequency), mitigation (fast fixes), recovery (verification), escalation (when + who), post-mortem (template). Keep in repo as Markdown; review quarterly with on-call rotation.

Structure

For each recurring incident class, one runbook with these sections:

  • Symptoms: what alerts fire / what users report
  • Triage: first 30 seconds — which dashboards confirm class
  • Diagnosis: ordered list of likely causes; how to identify which
  • Mitigation: fast-fix actions before deeper diagnosis (route traffic, restart, scale)
  • Recovery verification: which metrics return to baseline
  • Escalation: when on-call should escalate, to whom
  • Post-mortem: what to capture; deadline

Sections

Symptoms section — concrete:

  • Alert names that fire
  • User-facing manifestations
  • Distinguishing features vs other incident classes

Triage section — first 30 seconds:

  • Which Grafana dashboard URL
  • Which logs to check
  • Which signal confirms vs rules out

Mitigation — fast actions:

  • Specific commands / button clicks
  • Order of operations
  • Expected outcome of each step

Examples

Common AI runbooks:

  • vLLM queue overflow / 503s
  • GPU thermal throttling
  • p99 TTFT spike
  • Hosted-API fallback unreachable
  • Eval score regression detected on shadow traffic
  • Vector store query latency spike
  • OOM on vLLM startup
  • Cost-per-token regression

Verdict

Runbooks for the 8-12 common AI incident classes are essential. Per runbook ~30-60 minutes to write; together a few days of focused work. Update each time an incident exposes a gap. The investment pays off the first 3am page that resolves in 15 minutes instead of 90.

Bottom line

Runbook per incident class. See on-call rotation.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?