RTX 3050 - Order Now
Home / Blog / Tutorials / On-Call Runbook for an AI Inference Server: The 12 Most Common Incidents
Tutorials

On-Call Runbook for an AI Inference Server: The 12 Most Common Incidents

What goes wrong on a production AI inference server, in priority order, and how to triage each one. The runbook we hand to new on-call engineers.

AI inference servers fail in roughly the same dozen ways. This is the runbook we use.

TL;DR

Most incidents trace to: OOM under load spike, driver crash, queue depth blowout, downstream API timeout, disk full, or thermal throttle. The fixes are well-known.

Twelve common incidents

  1. p99 TTFT spike: queue depth too high. Reduce max-num-seqs or scale out.
  2. vLLM crashed: usually OOM. Check logs for CUDA out of memory. Lower gpu-memory-utilization.
  3. vLLM not responding to health check: hung kernel. Restart via systemd. If persistent, driver issue.
  4. 500 errors at low rate: usually individual prompt issues. Check structured logs.
  5. Disk full: HF cache, vLLM logs, or Qdrant. Most often HF cache (50+ GB).
  6. Thermal throttling: nvidia-smi shows clock dropped. Check airflow / temperature.
  7. Driver crashed: nvidia-smi can’t find GPU. Reboot. If persistent, downgrade driver.
  8. NCCL hang on multi-GPU: usually NIC binding issue. Check NCCL_SOCKET_IFNAME.
  9. Cold start after deploy: send synthetic warmup request before traffic.
  10. Quality regression: model commit SHA changed. Pin everything.
  11. Auth wall hit: per-key rate limit or budget. Check LiteLLM key state.
  12. Cloudflare 524 timeout: long-running request exceeded 100s. Use streaming or longer Cloudflare timeout (Enterprise plan).

Triage flow

  1. Check Grafana — look at TTFT, queue depth, GPU memory util in last 30 min
  2. Check vLLM logs — last 100 lines, look for stack traces
  3. Check nvidia-smi — GPU reachable? Memory used? Throttling?
  4. Check disk: df -h
  5. If hardware fault: file ticket with datacenter, swap to backup server

Verdict

Most AI server incidents are well-understood. Build the dashboard and runbook before launching, not during the first outage at 3 AM.

Bottom line

Operate like any other production backend. The fact that GPUs are involved doesn't change the runbook playbook. See monitoring guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?