Table of Contents
AI inference servers fail in roughly the same dozen ways. This is the runbook we use.
Most incidents trace to: OOM under load spike, driver crash, queue depth blowout, downstream API timeout, disk full, or thermal throttle. The fixes are well-known.
Twelve common incidents
- p99 TTFT spike: queue depth too high. Reduce
max-num-seqsor scale out. - vLLM crashed: usually OOM. Check logs for
CUDA out of memory. Lowergpu-memory-utilization. - vLLM not responding to health check: hung kernel. Restart via systemd. If persistent, driver issue.
- 500 errors at low rate: usually individual prompt issues. Check structured logs.
- Disk full: HF cache, vLLM logs, or Qdrant. Most often HF cache (50+ GB).
- Thermal throttling: nvidia-smi shows clock dropped. Check airflow / temperature.
- Driver crashed:
nvidia-smican’t find GPU. Reboot. If persistent, downgrade driver. - NCCL hang on multi-GPU: usually NIC binding issue. Check
NCCL_SOCKET_IFNAME. - Cold start after deploy: send synthetic warmup request before traffic.
- Quality regression: model commit SHA changed. Pin everything.
- Auth wall hit: per-key rate limit or budget. Check LiteLLM key state.
- Cloudflare 524 timeout: long-running request exceeded 100s. Use streaming or longer Cloudflare timeout (Enterprise plan).
Triage flow
- Check Grafana — look at TTFT, queue depth, GPU memory util in last 30 min
- Check vLLM logs — last 100 lines, look for stack traces
- Check
nvidia-smi— GPU reachable? Memory used? Throttling? - Check disk:
df -h - If hardware fault: file ticket with datacenter, swap to backup server
Verdict
Most AI server incidents are well-understood. Build the dashboard and runbook before launching, not during the first outage at 3 AM.
Bottom line
Operate like any other production backend. The fact that GPUs are involved doesn't change the runbook playbook. See monitoring guide.