Nvidia Data Center GPU Manager (DCGM) exposes deep GPU telemetry – power draw, temperature, memory bandwidth, utilisation, error counts. The dcgm-exporter wraps this in Prometheus format. On our dedicated GPU hosting it is the right foundation for production GPU observability.
Contents
Install
Docker is the simplest path:
docker run -d --gpus all --rm -p 9400:9400 \
--name dcgm-exporter \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
Scrape target http://host:9400/metrics from Prometheus. Metrics prefix is DCGM_FI_*.
Metrics
The ones that matter for AI workloads:
DCGM_FI_DEV_GPU_UTIL– GPU utilisation percentageDCGM_FI_DEV_MEM_COPY_UTIL– memory bandwidth utilisationDCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_FREE– VRAM used and freeDCGM_FI_DEV_GPU_TEMP– temperature in CDCGM_FI_DEV_POWER_USAGE– power draw in wattsDCGM_FI_DEV_XID_ERRORS– hardware errors (should be zero)
Dashboards
Nvidia publishes a reference Grafana dashboard (ID 12239). Import it and point at your Prometheus data source. For custom dashboards focus on:
- GPU utilisation heatmap across cards
- VRAM used percentage with threshold lines
- Power and temp over time
- Any non-zero XID errors (critical)
Alerts
groups:
- name: gpu-health
rules:
- alert: GPUOverTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
- alert: GPUXIDError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
- alert: GPUMemoryNearFull
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 10m
Observable GPU Hosting
DCGM Exporter preconfigured on UK dedicated servers with a sample Grafana stack.
Browse GPU ServersSee Prometheus + Grafana GPU monitoring and temperature monitoring.