RTX 3050 - Order Now
Home / Blog / Tutorials / AI Deployment Incident Runbook: The First 30 Minutes
Tutorials

AI Deployment Incident Runbook: The First 30 Minutes

What to do in the first 30 minutes of an AI inference incident — diagnostic order, common fixes, and when to fall back to a hosted API.

Table of Contents

  1. Triage flow
  2. Common fixes
  3. Verdict

When the AI server breaks at 3 AM, the runbook matters more than the architecture.

TL;DR

30-min triage: 1) check Grafana dashboards (60s), 2) check vLLM logs for stack traces (60s), 3) nvidia-smi for GPU health (30s), 4) trigger LiteLLM fallback to hosted API while diagnosing (60s), 5) restart vLLM if still unclear (2 min downtime).

Triage flow

  1. Grafana: TTFT, queue depth, GPU mem util in last 30 min
  2. vLLM logs: journalctl -u vllm -n 200
  3. nvidia-smi: GPU reachable? memory? throttling?
  4. Disk: df -h
  5. If hardware fault → trigger fallback, file ticket with datacenter

Common fixes

  • Queue blowout → reduce traffic, scale up max-num-seqs cautiously
  • OOM → restart vLLM, lower gpu-memory-utilization
  • Driver hung → reboot host (last resort)
  • Cold-start latency → send warmup request

Verdict

Most incidents resolve in 5-10 minutes with the right runbook. Build it before launch.

Bottom line

Practice the runbook quarterly. See on-call runbook.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?