Home / Blog / AI Hosting & Infrastructure / AI On-Call Rotation

AI Hosting & Infrastructure

AI On-Call Rotation

On-call practices for production AI — what alerts to wake people for, how to rotate, what runbooks to write.

AI Hosting & Infrastructure May 6, 2026 2 min read gigagpu

Table of Contents

Production AI on-call has specific patterns. The alert-able events are different from typical web services — generative output quality regressions, model drift, hosted-API fallback failures join the standard latency / error rate / capacity alerts.

TL;DR

On-call alerts that matter: GPU thermal / hardware, vLLM queue depth spike, p99 TTFT > SLO, hosted-API fallback failure, eval score drop > threshold, structured-output validation failure rate. Rotate weekly with primary + secondary. Runbook each alert: triage step, mitigation, when to escalate, how to verify recovery.

What to alert on

Alert-worthy (page someone):

GPU temp > 90°C sustained — hardware issue
p99 TTFT > 2× SLO for 5+ minutes — capacity or model issue
vLLM queue depth > 100 — capacity exhaustion
Error rate > 5% — service health
Hosted-API fallback unreachable — graceful degradation broken
Eval score drop > 5% on shadow traffic — quality regression

Watch-worthy (dashboard, not page):

GPU temp 82-90°C
Cache hit rate dropping
Cost per token rising
User feedback "not helpful" rate increasing

Rotation

Weekly rotation with primary + secondary on-call
Hand-off Monday with state-of-the-system briefing
Maximum 1 in 4 weeks for sustainability
Compensate appropriately (financial or time-off-in-lieu)
Junior engineers shadow before primary on-call

Runbooks

Each alert needs a runbook with:

Triage steps (which dashboards, which logs)
Mitigation actions (route traffic, restart service, scale)
When to escalate (timing, who to call)
How to verify recovery (which metrics to watch)
Post-incident actions

Verdict

On-call for production AI follows standard SRE patterns plus AI-specific extensions (eval drift, generative-quality regressions). Build the runbooks before you need them; rotate fairly; learn from every page. The first incident without a runbook costs more than writing twenty runbooks.

Bottom line

Standard SRE practices + AI-specific alerts. See incident response.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

AI Hosting & Infrastructure

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

AI On-Call Rotation

What to alert on

Rotation

Runbooks

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

AI On-Call Rotation

What to alert on

Rotation

Runbooks

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

On-Premise vs Cloud vs Dedicated: AI Hosting Guide

RTX 4090 24GB GDDR6X 1008 GB/s Bandwidth Explained

AI Data Pipeline: Batch vs Stream

Small Team AI Stack Blueprint

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?