Home / Blog / Tutorials / Monitoring an AI Inference Server: Prometheus, Grafana, and the Metrics That Matter

Tutorials

Monitoring an AI Inference Server: Prometheus, Grafana, and the Metrics That Matter

Production AI inference needs the same observability discipline as any other backend. Here are the metrics that actually predict outages, with Grafana dashboard recipes.

Tutorials May 5, 2026 2 min read gigagpu

Table of Contents

Most AI deployments monitor only the GPU temperature and call it observability. Real production needs application-level metrics that predict outages 5 minutes before they happen.

TL;DR

Three metric tiers: infrastructure (DCGM), inference engine (vLLM Prometheus), application (your structured logs). Alert on p99 TTFT, queue depth, cache hit rate, and GPU memory utilisation. Skip alerting on raw temperature unless you have a real cooling issue.

Metrics that matter

Infrastructure (DCGM exporter)

DCGM_FI_DEV_GPU_UTIL — GPU compute utilisation (%)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — VRAM in use vs free
DCGM_FI_DEV_POWER_USAGE — power draw (W)
DCGM_FI_DEV_GPU_TEMP — GPU temperature (only useful for alarm at >85°C)
DCGM_FI_DEV_THROTTLE_REASONS — non-zero = problem

Inference engine (vLLM with –enable-metrics)

vllm:num_requests_running — active sequences
vllm:num_requests_waiting — queue depth (alert >100)
vllm:gpu_cache_usage_perc — KV cache utilisation (alert >95%)
vllm:gpu_prefix_cache_hit_rate_perc — prefix cache hit rate
vllm:time_to_first_token_seconds — TTFT histogram
vllm:time_per_output_token_seconds — TPOT histogram

Application

Request latency p50/p95/p99
Token cost per request (output tokens)
Per-API-key request rate
Error rate by error class (4xx, 5xx, vLLM-specific)

Alerts that actually fire on real problems

p99 TTFT > 2s for 5 minutes — usually means queue depth blew out
vllm:num_requests_waiting > 100 for 2 minutes — incoming traffic exceeds capacity
GPU memory utilisation > 95% for 5 minutes — about to OOM
DCGM_FI_DEV_THROTTLE_REASONS != 0 — thermal or power throttling
5xx error rate > 1% — vLLM crashes or driver issues

Dashboard layout

Three rows on the main dashboard:

Top row: TTFT p99, queue depth, GPU mem util, error rate (large numbers, alert thresholds visible)
Middle row: throughput tok/s aggregate, requests/sec, prefix cache hit rate, GPU util (timeseries)
Bottom row: VRAM by component, request latency by p50/p95/p99 (timeseries)

Verdict

Run vLLM with --enable-metrics, scrape with Prometheus, dashboard with Grafana, alert on the four metrics above. Most outages telegraph themselves 5 minutes ahead in queue depth.

Bottom line

Build observability before you ship traffic. The boring infrastructure pays back the first time an LLM deployment misbehaves at 3 AM. See build a production AI inference server.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Monitoring an AI Inference Server: Prometheus, Grafana, and the Metrics That Matter

Metrics that matter

Infrastructure (DCGM exporter)

Inference engine (vLLM with –enable-metrics)

Application

Alerts that actually fire on real problems

Dashboard layout

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Monitoring an AI Inference Server: Prometheus, Grafana, and the Metrics That Matter

Metrics that matter

Infrastructure (DCGM exporter)

Inference engine (vLLM with –enable-metrics)

Application

Alerts that actually fire on real problems

Dashboard layout

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Whisper Slow on GPU: Speed Optimization

Migrate from RunPod to Dedicated GPU: Image Generation Guide

Async Agent Execution

Ollama Multi-Model Memory Management

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?