Table of Contents
When the AI server breaks at 3 AM, the runbook matters more than the architecture.
30-min triage: 1) check Grafana dashboards (60s), 2) check vLLM logs for stack traces (60s), 3) nvidia-smi for GPU health (30s), 4) trigger LiteLLM fallback to hosted API while diagnosing (60s), 5) restart vLLM if still unclear (2 min downtime).
Triage flow
- Grafana: TTFT, queue depth, GPU mem util in last 30 min
- vLLM logs:
journalctl -u vllm -n 200 - nvidia-smi: GPU reachable? memory? throttling?
- Disk:
df -h - If hardware fault → trigger fallback, file ticket with datacenter
Common fixes
- Queue blowout → reduce traffic, scale up max-num-seqs cautiously
- OOM → restart vLLM, lower gpu-memory-utilization
- Driver hung → reboot host (last resort)
- Cold-start latency → send warmup request
Verdict
Most incidents resolve in 5-10 minutes with the right runbook. Build it before launch.
Bottom line
Practice the runbook quarterly. See on-call runbook.