What You’ll Connect
After this guide, your Grafana Cloud instance will display real-time dashboards for your GPU AI infrastructure — no self-managed monitoring stack required. GPU utilisation, memory, inference latency, and throughput metrics from your dedicated GPU servers flow into Grafana Cloud where you build custom visualisations and alert rules.
The integration uses the Grafana Agent on your GPU server to scrape Prometheus metrics from both the NVIDIA DCGM exporter and your vLLM or Ollama inference endpoint, then ships them to Grafana Cloud’s hosted Prometheus backend.
Grafana Agent –> Grafana Cloud –> Dashboard + Alerts | | (Hosted Prometheus custom panels DCGM Exporter scrapes metrics + Grafana) for GPU AI (GPU metrics) every 15 seconds monitoring | | vLLM /metrics –> scrapes inference (latency, RPS) application stats –>Prerequisites
- A GigaGPU server with an LLM running on vLLM or Ollama (setup guide)
- A Grafana Cloud account (free tier includes 10,000 metrics series)
- SSH access to your GPU server
- Docker installed for running the DCGM exporter container
Integration Steps
Start the NVIDIA DCGM exporter on your GPU server as a Docker container. It exposes GPU metrics in Prometheus format on port 9400: utilisation, memory, temperature, power, ECC errors, and clock speeds. This is the standard approach for GPU observability in containerised environments.
Install the Grafana Agent on your GPU server. Configure it to scrape two targets: the DCGM exporter (localhost:9400/metrics) and your vLLM metrics endpoint (localhost:8000/metrics). Set the remote_write URL to your Grafana Cloud Prometheus endpoint using the instance ID and API key from your Grafana Cloud portal.
In Grafana Cloud, import or build a dashboard that visualises GPU metrics alongside inference performance data. The DCGM exporter produces metrics prefixed with DCGM_FI_, while vLLM metrics use the vllm: prefix. Combine both in panel queries for a complete operational view.
Code Example
Grafana Agent configuration and Docker Compose for monitoring your AI inference server:
# docker-compose.monitoring.yml
services:
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
ports:
- "9400:9400"
restart: unless-stopped
# /etc/grafana-agent/agent.yaml
metrics:
wal_directory: /tmp/grafana-agent-wal
global:
scrape_interval: 15s
remote_write:
- url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
basic_auth:
username: "YOUR_GRAFANA_CLOUD_ID"
password: "YOUR_GRAFANA_CLOUD_API_KEY"
configs:
- name: gpu_ai_metrics
scrape_configs:
- job_name: dcgm_exporter
static_configs:
- targets: ["localhost:9400"]
labels:
server: "inference-prod-1"
gpu_type: "a100"
- job_name: vllm_inference
static_configs:
- targets: ["localhost:8000"]
labels:
model: "llama-3-70b"
server: "inference-prod-1"
# Example Grafana dashboard panel queries:
# GPU Utilisation: DCGM_FI_DEV_GPU_UTIL{server="inference-prod-1"}
# GPU Memory Used: DCGM_FI_DEV_FB_USED{server="inference-prod-1"}
# Inference Latency: histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
# Requests/sec: rate(vllm:request_success_total[5m])
Testing Your Integration
Start the DCGM exporter and verify metrics are available: curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL. Start the Grafana Agent and check its logs for successful remote_write deliveries to Grafana Cloud. Navigate to Grafana Cloud’s Explore view and query DCGM_FI_DEV_GPU_UTIL — data points should appear within 30 seconds.
Send inference requests to generate vLLM metrics, then query vllm:num_requests_running in Explore to verify application metrics arrive alongside the GPU hardware metrics.
Production Tips
Create alert rules in Grafana Cloud for critical thresholds: GPU temperature above 85C, memory utilisation above 95%, inference latency p95 above your SLA target, and the DCGM exporter going offline. Route alerts to Slack, PagerDuty, or email based on severity.
Use Grafana’s dashboard variables to build a single dashboard that works across multiple GPU servers. A server selector dropdown lets you switch between instances without duplicating panels. This scales well as you add more GPU servers to your fleet.
Grafana Cloud’s free tier is generous enough for small deployments. For larger fleets running open-source models, the paid tier adds features like SLO tracking and advanced alerting. Pair monitoring with our secure API guide, browse more tutorials, or get started with GigaGPU to build observable AI infrastructure.