RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / AI On-Call Rotation
AI Hosting & Infrastructure

AI On-Call Rotation

On-call practices for production AI — what alerts to wake people for, how to rotate, what runbooks to write.

Production AI on-call has specific patterns. The alert-able events are different from typical web services — generative output quality regressions, model drift, hosted-API fallback failures join the standard latency / error rate / capacity alerts.

TL;DR

On-call alerts that matter: GPU thermal / hardware, vLLM queue depth spike, p99 TTFT > SLO, hosted-API fallback failure, eval score drop > threshold, structured-output validation failure rate. Rotate weekly with primary + secondary. Runbook each alert: triage step, mitigation, when to escalate, how to verify recovery.

What to alert on

Alert-worthy (page someone):

  • GPU temp > 90°C sustained — hardware issue
  • p99 TTFT > 2× SLO for 5+ minutes — capacity or model issue
  • vLLM queue depth > 100 — capacity exhaustion
  • Error rate > 5% — service health
  • Hosted-API fallback unreachable — graceful degradation broken
  • Eval score drop > 5% on shadow traffic — quality regression

Watch-worthy (dashboard, not page):

  • GPU temp 82-90°C
  • Cache hit rate dropping
  • Cost per token rising
  • User feedback "not helpful" rate increasing

Rotation

  • Weekly rotation with primary + secondary on-call
  • Hand-off Monday with state-of-the-system briefing
  • Maximum 1 in 4 weeks for sustainability
  • Compensate appropriately (financial or time-off-in-lieu)
  • Junior engineers shadow before primary on-call

Runbooks

Each alert needs a runbook with:

  • Triage steps (which dashboards, which logs)
  • Mitigation actions (route traffic, restart service, scale)
  • When to escalate (timing, who to call)
  • How to verify recovery (which metrics to watch)
  • Post-incident actions

Verdict

On-call for production AI follows standard SRE patterns plus AI-specific extensions (eval drift, generative-quality regressions). Build the runbooks before you need them; rotate fairly; learn from every page. The first incident without a runbook costs more than writing twenty runbooks.

Bottom line

Standard SRE practices + AI-specific alerts. See incident response.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?