Table of Contents
For production AI inference, observability is non-negotiable. Metrics tell you how the system is performing, logs let you diagnose specific issues, traces show request flow across services, and evals tell you whether output quality is holding. The four pillars need different tools.
Stack: DCGM Exporter + Prometheus + Grafana for hardware/serving metrics, Vector + Loki / Postgres / ClickHouse for structured logs, OpenTelemetry + Jaeger / Honeycomb for distributed traces, RAGAS / custom eval harness + dashboards for quality. ~1 day to set up; transformative for production operations.
Layers
- Metrics: time-series of system state (GPU temp, queue depth, p99 TTFT, throughput). Real-time alerting.
- Logs: structured events per request (user, prompt hash, response hash, tokens, cost). Forensic / audit / cost analysis.
- Traces: request flow across services (gateway → embeddings → LLM → reranker). Performance debugging.
- Evals: ongoing quality measurement. Regression detection.
Recommended stack
- Metrics: DCGM Exporter (GPU) + vLLM
/metrics+ node_exporter (host) → Prometheus → Grafana - Alerting: Alertmanager → Slack / PagerDuty
- Logs: Vector / Fluent Bit ships JSON logs → Loki (hot 30 days) + ClickHouse (analytics) + Postgres (audit)
- Traces: OpenTelemetry SDK in app → OTel Collector → Jaeger / Honeycomb
- Evals: RAGAS / custom harness in CI + scheduled production-shadow runs → metrics back to Prometheus
Integration
Tie the layers together via request_id:
- Gateway generates
request_idon every request - Propagates through OTel headers to all downstream services
- Logged in structured logs at every service
- One-click navigation from trace to logs to metrics in Grafana
Verdict
For production AI, this is the canonical observability stack in 2026. Setup is ~1 day; the value is decisive when incidents happen. Each layer addresses a different question; skipping any leaves blind spots.
Bottom line
Four-layer obs stack is the standard. See Prometheus setup.