Home / Blog / Tutorials / DCGM Exporter for GPU Metrics on a Dedicated Server

Tutorials

DCGM Exporter for GPU Metrics on a Dedicated Server

Nvidia DCGM Exporter emits Prometheus-format GPU metrics. Running it on a dedicated server is the right way to observe GPU health and utilisation.

Tutorials April 23, 2026 2 min read admin

Nvidia Data Center GPU Manager (DCGM) exposes deep GPU telemetry – power draw, temperature, memory bandwidth, utilisation, error counts. The dcgm-exporter wraps this in Prometheus format. On our dedicated GPU hosting it is the right foundation for production GPU observability.

Install
Metrics that matter
Grafana dashboards
Alert rules

Install

Docker is the simplest path:

docker run -d --gpus all --rm -p 9400:9400 \
  --name dcgm-exporter \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

Scrape target http://host:9400/metrics from Prometheus. Metrics prefix is DCGM_FI_*.

Metrics

The ones that matter for AI workloads:

DCGM_FI_DEV_GPU_UTIL – GPU utilisation percentage
DCGM_FI_DEV_MEM_COPY_UTIL – memory bandwidth utilisation
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE – VRAM used and free
DCGM_FI_DEV_GPU_TEMP – temperature in C
DCGM_FI_DEV_POWER_USAGE – power draw in watts
DCGM_FI_DEV_XID_ERRORS – hardware errors (should be zero)

Dashboards

Nvidia publishes a reference Grafana dashboard (ID 12239). Import it and point at your Prometheus data source. For custom dashboards focus on:

GPU utilisation heatmap across cards
VRAM used percentage with threshold lines
Power and temp over time
Any non-zero XID errors (critical)

Alerts

groups:
  - name: gpu-health
    rules:
      - alert: GPUOverTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
      - alert: GPUMemoryNearFull
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 10m

Observable GPU Hosting

DCGM Exporter preconfigured on UK dedicated servers with a sample Grafana stack.

Browse GPU Servers

See Prometheus + Grafana GPU monitoring and temperature monitoring.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

DCGM Exporter for GPU Metrics on a Dedicated Server

Contents

Install

Metrics

Dashboards

Alerts

Observable GPU Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

DCGM Exporter for GPU Metrics on a Dedicated Server

Contents

Install

Metrics

Dashboards

Alerts

Observable GPU Hosting

Need a Dedicated GPU Server?

admin

Related Articles

Ollama Context Length Configuration

Fine-Tuning an Embedding Model on a Dedicated GPU

Jina Embeddings v3 on a Dedicated GPU

ZFS vs ext4 on a GPU Server for Model Storage

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?