RTX 3050 - Order Now
Home / Blog / Tutorials / Connect Grafana Cloud to GPU Server Metrics
Tutorials

Connect Grafana Cloud to GPU Server Metrics

Visualise GPU server metrics in Grafana Cloud for your self-hosted AI infrastructure. This guide covers Prometheus exporters, GPU metric collection, custom dashboards for inference monitoring, and connecting everything to Grafana Cloud's hosted platform.

What You’ll Connect

After this guide, your Grafana Cloud instance will display real-time dashboards for your GPU AI infrastructure — no self-managed monitoring stack required. GPU utilisation, memory, inference latency, and throughput metrics from your dedicated GPU servers flow into Grafana Cloud where you build custom visualisations and alert rules.

The integration uses the Grafana Agent on your GPU server to scrape Prometheus metrics from both the NVIDIA DCGM exporter and your vLLM or Ollama inference endpoint, then ships them to Grafana Cloud’s hosted Prometheus backend.

Grafana Agent –> Grafana Cloud –> Dashboard + Alerts | | (Hosted Prometheus custom panels DCGM Exporter scrapes metrics + Grafana) for GPU AI (GPU metrics) every 15 seconds monitoring | | vLLM /metrics –> scrapes inference (latency, RPS) application stats –>

Prerequisites

  • A GigaGPU server with an LLM running on vLLM or Ollama (setup guide)
  • A Grafana Cloud account (free tier includes 10,000 metrics series)
  • SSH access to your GPU server
  • Docker installed for running the DCGM exporter container

Integration Steps

Start the NVIDIA DCGM exporter on your GPU server as a Docker container. It exposes GPU metrics in Prometheus format on port 9400: utilisation, memory, temperature, power, ECC errors, and clock speeds. This is the standard approach for GPU observability in containerised environments.

Install the Grafana Agent on your GPU server. Configure it to scrape two targets: the DCGM exporter (localhost:9400/metrics) and your vLLM metrics endpoint (localhost:8000/metrics). Set the remote_write URL to your Grafana Cloud Prometheus endpoint using the instance ID and API key from your Grafana Cloud portal.

In Grafana Cloud, import or build a dashboard that visualises GPU metrics alongside inference performance data. The DCGM exporter produces metrics prefixed with DCGM_FI_, while vLLM metrics use the vllm: prefix. Combine both in panel queries for a complete operational view.

Code Example

Grafana Agent configuration and Docker Compose for monitoring your AI inference server:

# docker-compose.monitoring.yml
services:
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    ports:
      - "9400:9400"
    restart: unless-stopped

# /etc/grafana-agent/agent.yaml
metrics:
  wal_directory: /tmp/grafana-agent-wal
  global:
    scrape_interval: 15s
    remote_write:
      - url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
        basic_auth:
          username: "YOUR_GRAFANA_CLOUD_ID"
          password: "YOUR_GRAFANA_CLOUD_API_KEY"

  configs:
    - name: gpu_ai_metrics
      scrape_configs:
        - job_name: dcgm_exporter
          static_configs:
            - targets: ["localhost:9400"]
              labels:
                server: "inference-prod-1"
                gpu_type: "a100"

        - job_name: vllm_inference
          static_configs:
            - targets: ["localhost:8000"]
              labels:
                model: "llama-3-70b"
                server: "inference-prod-1"

# Example Grafana dashboard panel queries:
# GPU Utilisation:  DCGM_FI_DEV_GPU_UTIL{server="inference-prod-1"}
# GPU Memory Used:  DCGM_FI_DEV_FB_USED{server="inference-prod-1"}
# Inference Latency: histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
# Requests/sec:    rate(vllm:request_success_total[5m])

Testing Your Integration

Start the DCGM exporter and verify metrics are available: curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL. Start the Grafana Agent and check its logs for successful remote_write deliveries to Grafana Cloud. Navigate to Grafana Cloud’s Explore view and query DCGM_FI_DEV_GPU_UTIL — data points should appear within 30 seconds.

Send inference requests to generate vLLM metrics, then query vllm:num_requests_running in Explore to verify application metrics arrive alongside the GPU hardware metrics.

Production Tips

Create alert rules in Grafana Cloud for critical thresholds: GPU temperature above 85C, memory utilisation above 95%, inference latency p95 above your SLA target, and the DCGM exporter going offline. Route alerts to Slack, PagerDuty, or email based on severity.

Use Grafana’s dashboard variables to build a single dashboard that works across multiple GPU servers. A server selector dropdown lets you switch between instances without duplicating panels. This scales well as you add more GPU servers to your fleet.

Grafana Cloud’s free tier is generous enough for small deployments. For larger fleets running open-source models, the paid tier adds features like SLO tracking and advanced alerting. Pair monitoring with our secure API guide, browse more tutorials, or get started with GigaGPU to build observable AI infrastructure.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?