Table of Contents
Most AI deployments under-monitor their GPUs. This is the practical setup we ship.
Three layers: nvidia-smi for ad-hoc checks, DCGM exporter for Prometheus metrics, vLLM Prometheus for inference-engine metrics. Alert on TTFT p99, queue depth, and GPU memory util > 95%.
Tools
- nvidia-smi: built-in CLI, manual checks
- nvitop: htop-style live view
- DCGM exporter: NVIDIA Prometheus exporter —
docker run -d --gpus all nvcr.io/nvidia/k8s/dcgm-exporter - vLLM
--enable-metrics: exposes /metrics in Prometheus format - Grafana: dashboards on top of Prometheus
Metrics that matter
- DCGM_FI_DEV_GPU_UTIL: GPU compute utilisation %
- DCGM_FI_DEV_FB_USED: VRAM used (most important for AI)
- DCGM_FI_DEV_POWER_USAGE: power draw W
- DCGM_FI_DEV_GPU_TEMP: temperature (alarm at >85°C)
- DCGM_FI_DEV_THROTTLE_REASONS: non-zero = problem
- vllm:num_requests_waiting: queue depth (alert >100)
- vllm:gpu_cache_usage_perc: KV cache util (alert >95%)
- vllm:time_to_first_token_seconds: TTFT (alert p99 >2s)
Alerts
- p99 TTFT > 2s for 5 min — queue blowout
- GPU memory util > 95% for 5 min — about to OOM
- Throttle reasons != 0 — thermal or power issue
- 5xx error rate > 1% — vLLM crashes
Bottom line
Three-tier monitoring: nvidia-smi + DCGM + vLLM. See monitoring guide.