You will set up the ELK stack (Elasticsearch, Logstash, Kibana) to capture, index, and visualise logs from your AI inference pipeline. By the end, you will have structured logging on your GPU server that lets you debug inference failures, track request patterns, and monitor model behaviour across your entire stack.
Logging Architecture
| Component | Role | Port |
|---|---|---|
| Application | Structured JSON logs | — |
| Filebeat | Log collection and shipping | — |
| Logstash | Parse, transform, enrich | 5044 |
| Elasticsearch | Index and search | 9200 |
| Kibana | Dashboards and exploration | 5601 |
Structured Logging in Python
Emit structured JSON logs from your inference server so that the ELK stack can parse fields automatically.
import logging
import json
import time
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"service": "inference-api",
"message": record.getMessage(),
}
if hasattr(record, "request_id"):
log_data["request_id"] = record.request_id
if hasattr(record, "model"):
log_data["model"] = record.model
if hasattr(record, "tokens"):
log_data["tokens"] = record.tokens
if hasattr(record, "latency_ms"):
log_data["latency_ms"] = record.latency_ms
return json.dumps(log_data)
logger = logging.getLogger("inference")
handler = logging.FileHandler("/var/log/inference/api.json")
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Log inference requests
def log_inference(request_id, model, prompt_tokens, completion_tokens, latency):
logger.info(
"Inference completed",
extra={
"request_id": request_id,
"model": model,
"tokens": prompt_tokens + completion_tokens,
"latency_ms": round(latency * 1000, 2)
}
)
For the inference server itself, see the FastAPI server guide or the Flask API guide.
ELK Stack Setup
Deploy the stack with Docker Compose on your GPU server or a separate monitoring server.
# docker-compose.yml
version: "3.8"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.12.0
ports:
- "5044:5044"
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
volumes:
es_data:
Logstash Pipeline
Configure Logstash to parse JSON logs and enrich them with GPU server metadata.
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
json {
source => "message"
target => "inference"
}
if [inference][latency_ms] {
mutate {
convert => { "[inference][latency_ms]" => "float" }
convert => { "[inference][tokens]" => "integer" }
}
}
mutate {
add_field => { "gpu_server" => "%{[host][name]}" }
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "inference-logs-%{+YYYY.MM.dd}"
}
}
Filebeat Configuration
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/inference/*.json
json.keys_under_root: true
json.add_error_key: true
- type: log
enabled: true
paths:
- /var/log/vllm/*.log
fields:
service: vllm
output.logstash:
hosts: ["logstash-server:5044"]
Kibana Dashboards
Build dashboards in Kibana for inference observability. Key visualisations include:
- Request volume over time — line chart of requests per minute.
- Latency distribution — histogram of
inference.latency_msto spot slow requests. - Error rate — percentage of failed inference requests by model and error type.
- Token throughput — sum of
inference.tokensper time period. - Model usage breakdown — pie chart by
inference.modelfield.
For metrics-based monitoring alongside logs, see the Prometheus and Grafana guide. For webhook notifications on errors, check the webhook integration guide. The self-hosting guide covers infrastructure, and our tutorials section has more observability patterns. Set up the backend with the vLLM production guide. For container logging in Kubernetes, see the GPU pod guide.
Monitor AI Inference with ELK on Dedicated GPUs
Deploy full logging and observability on bare-metal GPU servers. Debug inference issues, track model behaviour.
Browse GPU Servers