RTX 3050 - Order Now
Home / Blog / Tutorials / DCGM Exporter for GPU Metrics on a Dedicated Server
Tutorials

DCGM Exporter for GPU Metrics on a Dedicated Server

Nvidia DCGM Exporter emits Prometheus-format GPU metrics. Running it on a dedicated server is the right way to observe GPU health and utilisation.

Nvidia Data Center GPU Manager (DCGM) exposes deep GPU telemetry – power draw, temperature, memory bandwidth, utilisation, error counts. The dcgm-exporter wraps this in Prometheus format. On our dedicated GPU hosting it is the right foundation for production GPU observability.

Contents

Install

Docker is the simplest path:

docker run -d --gpus all --rm -p 9400:9400 \
  --name dcgm-exporter \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

Scrape target http://host:9400/metrics from Prometheus. Metrics prefix is DCGM_FI_*.

Metrics

The ones that matter for AI workloads:

  • DCGM_FI_DEV_GPU_UTIL – GPU utilisation percentage
  • DCGM_FI_DEV_MEM_COPY_UTIL – memory bandwidth utilisation
  • DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE – VRAM used and free
  • DCGM_FI_DEV_GPU_TEMP – temperature in C
  • DCGM_FI_DEV_POWER_USAGE – power draw in watts
  • DCGM_FI_DEV_XID_ERRORS – hardware errors (should be zero)

Dashboards

Nvidia publishes a reference Grafana dashboard (ID 12239). Import it and point at your Prometheus data source. For custom dashboards focus on:

  • GPU utilisation heatmap across cards
  • VRAM used percentage with threshold lines
  • Power and temp over time
  • Any non-zero XID errors (critical)

Alerts

groups:
  - name: gpu-health
    rules:
      - alert: GPUOverTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
      - alert: GPUMemoryNearFull
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 10m

Observable GPU Hosting

DCGM Exporter preconfigured on UK dedicated servers with a sample Grafana stack.

Browse GPU Servers

See Prometheus + Grafana GPU monitoring and temperature monitoring.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?