RTX 3050 - Order Now
Home / Blog / Tutorials / Monitoring an AI Inference Server: Prometheus, Grafana, and the Metrics That Matter
Tutorials

Monitoring an AI Inference Server: Prometheus, Grafana, and the Metrics That Matter

Production AI inference needs the same observability discipline as any other backend. Here are the metrics that actually predict outages, with Grafana dashboard recipes.

Most AI deployments monitor only the GPU temperature and call it observability. Real production needs application-level metrics that predict outages 5 minutes before they happen.

TL;DR

Three metric tiers: infrastructure (DCGM), inference engine (vLLM Prometheus), application (your structured logs). Alert on p99 TTFT, queue depth, cache hit rate, and GPU memory utilisation. Skip alerting on raw temperature unless you have a real cooling issue.

Metrics that matter

Infrastructure (DCGM exporter)

  • DCGM_FI_DEV_GPU_UTIL — GPU compute utilisation (%)
  • DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — VRAM in use vs free
  • DCGM_FI_DEV_POWER_USAGE — power draw (W)
  • DCGM_FI_DEV_GPU_TEMP — GPU temperature (only useful for alarm at >85°C)
  • DCGM_FI_DEV_THROTTLE_REASONS — non-zero = problem

Inference engine (vLLM with –enable-metrics)

  • vllm:num_requests_running — active sequences
  • vllm:num_requests_waiting — queue depth (alert >100)
  • vllm:gpu_cache_usage_perc — KV cache utilisation (alert >95%)
  • vllm:gpu_prefix_cache_hit_rate_perc — prefix cache hit rate
  • vllm:time_to_first_token_seconds — TTFT histogram
  • vllm:time_per_output_token_seconds — TPOT histogram

Application

  • Request latency p50/p95/p99
  • Token cost per request (output tokens)
  • Per-API-key request rate
  • Error rate by error class (4xx, 5xx, vLLM-specific)

Alerts that actually fire on real problems

  • p99 TTFT > 2s for 5 minutes — usually means queue depth blew out
  • vllm:num_requests_waiting > 100 for 2 minutes — incoming traffic exceeds capacity
  • GPU memory utilisation > 95% for 5 minutes — about to OOM
  • DCGM_FI_DEV_THROTTLE_REASONS != 0 — thermal or power throttling
  • 5xx error rate > 1% — vLLM crashes or driver issues

Dashboard layout

Three rows on the main dashboard:

  1. Top row: TTFT p99, queue depth, GPU mem util, error rate (large numbers, alert thresholds visible)
  2. Middle row: throughput tok/s aggregate, requests/sec, prefix cache hit rate, GPU util (timeseries)
  3. Bottom row: VRAM by component, request latency by p50/p95/p99 (timeseries)

Verdict

Run vLLM with --enable-metrics, scrape with Prometheus, dashboard with Grafana, alert on the four metrics above. Most outages telegraph themselves 5 minutes ahead in queue depth.

Bottom line

Build observability before you ship traffic. The boring infrastructure pays back the first time an LLM deployment misbehaves at 3 AM. See build a production AI inference server.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?