Production LLM serving generates high-volume logs. Unstructured text lines are a pain to search and aggregate. Structured JSON logs make everything from alerting to debugging tractable. On our dedicated GPU hosting this is the logging default we recommend.
Contents
What to Log
Per request:
- Request ID (trace-correlatable)
- User or API key (hashed)
- Model name and version
- Input token count and output token count
- Time-to-first-token
- Total latency
- Status (success, truncated, error)
- Temperature, top_p, max_tokens
Do not log prompts or completions by default – it’s expensive and privacy-sensitive. Opt in per customer if they need debug logging.
Format
{
"timestamp": "2029-08-15T10:23:45.123Z",
"level": "INFO",
"event": "inference_complete",
"request_id": "req_xyz",
"api_key_hash": "sha256:abc",
"model": "llama-3.3-70b",
"input_tokens": 247,
"output_tokens": 112,
"ttft_ms": 380,
"total_ms": 4120,
"status": "success"
}
Every field is a known key. No free-text messages that require regex parsing.
Implementation
For a FastAPI wrapper around vLLM:
import structlog
log = structlog.get_logger()
@app.middleware("http")
async def log_requests(request, call_next):
start = time.time()
response = await call_next(request)
log.info(
"inference_complete",
request_id=request.state.request_id,
method=request.method,
path=request.url.path,
status=response.status_code,
duration_ms=int((time.time() - start) * 1000),
)
return response
For vLLM’s built-in logging, redirect stdout through a JSON formatter or use Python’s logging config with python-json-logger.
Aggregation
Ship JSON logs to Loki, Elasticsearch, or an S3-compatible bucket via Vector or Fluent Bit. Grafana or Kibana queries then let you slice by model, by customer, by error type.
Observable LLM Hosting
Structured logging and Grafana preconfigured on UK dedicated GPU servers.
Browse GPU ServersSee DCGM Exporter and AI audit trail logging.