Table of Contents
Most AI deployments monitor only the GPU temperature and call it observability. Real production needs application-level metrics that predict outages 5 minutes before they happen.
Three metric tiers: infrastructure (DCGM), inference engine (vLLM Prometheus), application (your structured logs). Alert on p99 TTFT, queue depth, cache hit rate, and GPU memory utilisation. Skip alerting on raw temperature unless you have a real cooling issue.
Metrics that matter
Infrastructure (DCGM exporter)
DCGM_FI_DEV_GPU_UTIL— GPU compute utilisation (%)DCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_FREE— VRAM in use vs freeDCGM_FI_DEV_POWER_USAGE— power draw (W)DCGM_FI_DEV_GPU_TEMP— GPU temperature (only useful for alarm at >85°C)DCGM_FI_DEV_THROTTLE_REASONS— non-zero = problem
Inference engine (vLLM with –enable-metrics)
vllm:num_requests_running— active sequencesvllm:num_requests_waiting— queue depth (alert >100)vllm:gpu_cache_usage_perc— KV cache utilisation (alert >95%)vllm:gpu_prefix_cache_hit_rate_perc— prefix cache hit ratevllm:time_to_first_token_seconds— TTFT histogramvllm:time_per_output_token_seconds— TPOT histogram
Application
- Request latency p50/p95/p99
- Token cost per request (output tokens)
- Per-API-key request rate
- Error rate by error class (4xx, 5xx, vLLM-specific)
Alerts that actually fire on real problems
- p99 TTFT > 2s for 5 minutes — usually means queue depth blew out
- vllm:num_requests_waiting > 100 for 2 minutes — incoming traffic exceeds capacity
- GPU memory utilisation > 95% for 5 minutes — about to OOM
- DCGM_FI_DEV_THROTTLE_REASONS != 0 — thermal or power throttling
- 5xx error rate > 1% — vLLM crashes or driver issues
Dashboard layout
Three rows on the main dashboard:
- Top row: TTFT p99, queue depth, GPU mem util, error rate (large numbers, alert thresholds visible)
- Middle row: throughput tok/s aggregate, requests/sec, prefix cache hit rate, GPU util (timeseries)
- Bottom row: VRAM by component, request latency by p50/p95/p99 (timeseries)
Verdict
Run vLLM with --enable-metrics, scrape with Prometheus, dashboard with Grafana, alert on the four metrics above. Most outages telegraph themselves 5 minutes ahead in queue depth.
Bottom line
Build observability before you ship traffic. The boring infrastructure pays back the first time an LLM deployment misbehaves at 3 AM. See build a production AI inference server.