Home / Blog / Tutorials / How to Monitor GPU Usage on a Dedicated Server

Tutorials

How to Monitor GPU Usage on a Dedicated Server

Set up comprehensive GPU monitoring on your dedicated server using nvidia-smi, Prometheus, Grafana, and DCGM. Track VRAM usage, temperature, power draw, and utilisation in real time.

Tutorials April 10, 2026 5 min read gigagpu

Keeping an eye on GPU utilisation is critical when running AI workloads on a dedicated GPU server. Underutilised GPUs waste money, while overloaded ones throttle and degrade inference latency. This tutorial covers monitoring approaches from simple CLI tools to full Prometheus and Grafana dashboards, so you can track every metric that matters on your LLM hosting infrastructure.

Table of Contents

nvidia-smi: Quick GPU Monitoring
NVIDIA DCGM Exporter for Prometheus
Set Up Prometheus for GPU Metrics
Grafana GPU Dashboards
Custom Monitoring Scripts
Alerting on GPU Anomalies

nvidia-smi: Quick GPU Monitoring

The fastest way to check GPU status is nvidia-smi, which ships with the NVIDIA driver. If you need to install or update your drivers, see our CUDA installation guide.

# One-shot GPU status
nvidia-smi

# Continuous monitoring every 1 second
nvidia-smi -l 1

# Structured CSV output for scripting
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw \
    --format=csv -l 1

# Monitor specific GPU processes
nvidia-smi pmon -i 0 -s u -d 1

# Query per-process memory usage
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv

For a persistent terminal monitor, use nvtop which provides a top-like interface for GPUs:

# Install nvtop
sudo apt update && sudo apt install -y nvtop

# Launch the interactive monitor
nvtop

These tools are excellent for quick debugging but inadequate for production monitoring at scale. For that, you need a metrics pipeline. If you are running containerised workloads, nvidia-smi also works inside Docker GPU containers.

NVIDIA DCGM Exporter for Prometheus

NVIDIA Data Center GPU Manager (DCGM) exposes detailed GPU metrics in Prometheus format. This is especially important for multi-GPU cluster environments where tracking individual card health is critical. Install the DCGM exporter as a Docker container or systemd service:

# Run DCGM exporter as a Docker container
docker run -d --gpus all --rm \
    -p 9400:9400 \
    --name dcgm-exporter \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04

# Verify metrics are being exported
curl -s localhost:9400/metrics | head -20

# Check specific metrics
curl -s localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
curl -s localhost:9400/metrics | grep DCGM_FI_DEV_FB_USED

Alternatively, install DCGM natively:

# Install DCGM packages
sudo apt install -y datacenter-gpu-manager

# Start DCGM service
sudo systemctl enable --now nvidia-dcgm

# Query GPU health
dcgmi discovery -l
dcgmi diag -r 1

Set Up Prometheus for GPU Metrics

Install Prometheus to scrape and store GPU metrics from DCGM exporter:

# Download and install Prometheus
cd /opt
sudo wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
sudo tar xzf prometheus-2.51.0.linux-amd64.tar.gz
sudo mv prometheus-2.51.0.linux-amd64 prometheus

Configure Prometheus to scrape the DCGM exporter and (optionally) your vLLM metrics endpoint:

# /opt/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'dcgm-exporter'
    static_configs:
      - targets: ['localhost:9400']

  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

Create a systemd service for Prometheus:

# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/opt/prometheus/prometheus \
    --config.file=/opt/prometheus/prometheus.yml \
    --storage.tsdb.path=/opt/prometheus/data \
    --storage.tsdb.retention.time=30d
Restart=always

[Install]
WantedBy=multi-user.target

# Create user and start service
sudo useradd --system --no-create-home prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

# Verify Prometheus is scraping
curl -s localhost:9090/api/v1/targets | python3 -m json.tool | head -30

If you are serving models with vLLM, the vLLM metrics endpoint provides token throughput and KV cache utilisation. See our vLLM memory optimisation guide for tuning based on these metrics.

Grafana GPU Dashboards

Install Grafana to visualise your GPU metrics:

# Install Grafana
sudo apt install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana

# Start Grafana
sudo systemctl enable --now grafana-server

Access Grafana at http://your-server:3000 (default credentials: admin/admin). Add Prometheus as a data source, then import the NVIDIA DCGM dashboard:

# Import dashboard via API
curl -X POST http://admin:admin@localhost:3000/api/datasources \
    -H "Content-Type: application/json" \
    -d '{
        "name": "Prometheus",
        "type": "prometheus",
        "url": "http://localhost:9090",
        "access": "proxy",
        "isDefault": true
    }'

# Import the NVIDIA DCGM dashboard (ID: 12239)
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
    -H "Content-Type: application/json" \
    -d '{
        "dashboard": {"id": null, "uid": null, "title": "NVIDIA DCGM"},
        "overwrite": true,
        "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}],
        "folderId": 0,
        "pluginId": "grafana-simple-json-datasource"
    }'

Custom Monitoring Scripts

For lightweight monitoring without a full stack, use this Python script to log GPU stats. This is a good approach if you are hosting a private AI deployment and want simple observability without external services:

#!/usr/bin/env python3
# gpu_monitor.py — Log GPU stats to CSV
import subprocess
import csv
import time
from datetime import datetime

LOG_FILE = "/var/log/gpu_monitor.csv"
INTERVAL = 10  # seconds

def get_gpu_stats():
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=index,name,temperature.gpu,utilization.gpu,"
         "utilization.memory,memory.used,memory.total,power.draw,power.limit",
         "--format=csv,noheader,nounits"],
        capture_output=True, text=True
    )
    return result.stdout.strip().split("\n")

def main():
    with open(LOG_FILE, "a", newline="") as f:
        writer = csv.writer(f)
        while True:
            timestamp = datetime.now().isoformat()
            for line in get_gpu_stats():
                values = [v.strip() for v in line.split(",")]
                writer.writerow([timestamp] + values)
            f.flush()
            time.sleep(INTERVAL)

if __name__ == "__main__":
    main()

# Run as a systemd service
sudo tee /etc/systemd/system/gpu-monitor.service > /dev/null << 'EOF'
[Unit]
Description=GPU Monitor Script
After=network.target

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/gpu_monitor.py
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now gpu-monitor

Alerting on GPU Anomalies

Set up Prometheus alerting rules to get notified when GPU metrics cross thresholds. Alerting is essential for auto-scaling inference architectures where scaling decisions depend on real-time GPU data. Also pair alerting with the API security layer to detect abuse patterns:

# /opt/prometheus/alert_rules.yml
groups:
  - name: gpu_alerts
    rules:
      - alert: GPUHighTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} temperature above 85°C"

      - alert: GPUMemoryExhausted
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU {{ $labels.gpu }} VRAM usage above 95%"

      - alert: GPUUtilizationLow
        expr: DCGM_FI_DEV_GPU_UTIL < 10
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "GPU {{ $labels.gpu }} utilization below 10% for 15 minutes"

# Add alert rules to Prometheus config
# In /opt/prometheus/prometheus.yml, add:
rule_files:
  - "alert_rules.yml"

# Restart Prometheus
sudo systemctl restart prometheus

Proper monitoring helps you right-size your GPU allocation and identify when it is time to scale. For performance testing, see our GPU benchmarking guide. Compare performance across different hardware using the tokens per second benchmark. For cost analysis, check the cost per million tokens calculator. Browse all infrastructure guides in the AI hosting and infrastructure category.

Full Visibility Into Your GPU Infrastructure

GigaGPU dedicated servers include IPMI access and full root control for complete monitoring flexibility. Deploy Prometheus, Grafana, and DCGM on high-performance NVIDIA hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

How to Monitor GPU Usage on a Dedicated Server

nvidia-smi: Quick GPU Monitoring

NVIDIA DCGM Exporter for Prometheus

Set Up Prometheus for GPU Metrics

Grafana GPU Dashboards

Custom Monitoring Scripts

Alerting on GPU Anomalies

Full Visibility Into Your GPU Infrastructure

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

How to Monitor GPU Usage on a Dedicated Server

nvidia-smi: Quick GPU Monitoring

NVIDIA DCGM Exporter for Prometheus

Set Up Prometheus for GPU Metrics

Grafana GPU Dashboards

Custom Monitoring Scripts

Alerting on GPU Anomalies

Full Visibility Into Your GPU Infrastructure

Need a Dedicated GPU Server?

gigagpu

Related Articles

Flux.1 Generation Errors: Common Fixes

LoRA Loading Errors in Stable Diffusion: Fix

Ollama Out of Memory: VRAM Fix

ComfyUI on RTX 4090 24GB: Production Install, Custom Nodes and Workflows

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?