Home / Blog / Tutorials / On-Call Runbook for an AI Inference Server: The 12 Most Common Incidents

Tutorials

On-Call Runbook for an AI Inference Server: The 12 Most Common Incidents

What goes wrong on a production AI inference server, in priority order, and how to triage each one. The runbook we hand to new on-call engineers.

Tutorials May 5, 2026 2 min read gigagpu

Table of Contents

AI inference servers fail in roughly the same dozen ways. This is the runbook we use.

TL;DR

Most incidents trace to: OOM under load spike, driver crash, queue depth blowout, downstream API timeout, disk full, or thermal throttle. The fixes are well-known.

Twelve common incidents

p99 TTFT spike: queue depth too high. Reduce max-num-seqs or scale out.
vLLM crashed: usually OOM. Check logs for CUDA out of memory. Lower gpu-memory-utilization.
vLLM not responding to health check: hung kernel. Restart via systemd. If persistent, driver issue.
500 errors at low rate: usually individual prompt issues. Check structured logs.
Disk full: HF cache, vLLM logs, or Qdrant. Most often HF cache (50+ GB).
Thermal throttling: nvidia-smi shows clock dropped. Check airflow / temperature.
Driver crashed: nvidia-smi can’t find GPU. Reboot. If persistent, downgrade driver.
NCCL hang on multi-GPU: usually NIC binding issue. Check NCCL_SOCKET_IFNAME.
Cold start after deploy: send synthetic warmup request before traffic.
Quality regression: model commit SHA changed. Pin everything.
Auth wall hit: per-key rate limit or budget. Check LiteLLM key state.
Cloudflare 524 timeout: long-running request exceeded 100s. Use streaming or longer Cloudflare timeout (Enterprise plan).

Triage flow

Check Grafana — look at TTFT, queue depth, GPU memory util in last 30 min
Check vLLM logs — last 100 lines, look for stack traces
Check nvidia-smi — GPU reachable? Memory used? Throttling?
Check disk: df -h
If hardware fault: file ticket with datacenter, swap to backup server

Verdict

Most AI server incidents are well-understood. Build the dashboard and runbook before launching, not during the first outage at 3 AM.

Bottom line

Operate like any other production backend. The fact that GPUs are involved doesn't change the runbook playbook. See monitoring guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

On-Call Runbook for an AI Inference Server: The 12 Most Common Incidents

Twelve common incidents

Triage flow

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

On-Call Runbook for an AI Inference Server: The 12 Most Common Incidents

Twelve common incidents

Triage flow

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

BGE-M3 Self-Hosted on a Dedicated GPU

vLLM Deployment on the RTX 3090 24 GB: Production Recipe

How to Build a Production AI Inference Server: Hardware, Software, and the 8 Mistakes Everyone Makes

RAG for Different Document Types: PDF, HTML, Code, Tables

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?