RTX 3050 - Order Now
Home / Blog / Tutorials / Structured Logging for LLM Inference
Tutorials

Structured Logging for LLM Inference

Text logs are hard to search. Structured JSON logs let you aggregate, alert, and debug efficiently on a dedicated GPU serving production traffic.

Production LLM serving generates high-volume logs. Unstructured text lines are a pain to search and aggregate. Structured JSON logs make everything from alerting to debugging tractable. On our dedicated GPU hosting this is the logging default we recommend.

Contents

What to Log

Per request:

  • Request ID (trace-correlatable)
  • User or API key (hashed)
  • Model name and version
  • Input token count and output token count
  • Time-to-first-token
  • Total latency
  • Status (success, truncated, error)
  • Temperature, top_p, max_tokens

Do not log prompts or completions by default – it’s expensive and privacy-sensitive. Opt in per customer if they need debug logging.

Format

{
  "timestamp": "2029-08-15T10:23:45.123Z",
  "level": "INFO",
  "event": "inference_complete",
  "request_id": "req_xyz",
  "api_key_hash": "sha256:abc",
  "model": "llama-3.3-70b",
  "input_tokens": 247,
  "output_tokens": 112,
  "ttft_ms": 380,
  "total_ms": 4120,
  "status": "success"
}

Every field is a known key. No free-text messages that require regex parsing.

Implementation

For a FastAPI wrapper around vLLM:

import structlog

log = structlog.get_logger()

@app.middleware("http")
async def log_requests(request, call_next):
    start = time.time()
    response = await call_next(request)
    log.info(
        "inference_complete",
        request_id=request.state.request_id,
        method=request.method,
        path=request.url.path,
        status=response.status_code,
        duration_ms=int((time.time() - start) * 1000),
    )
    return response

For vLLM’s built-in logging, redirect stdout through a JSON formatter or use Python’s logging config with python-json-logger.

Aggregation

Ship JSON logs to Loki, Elasticsearch, or an S3-compatible bucket via Vector or Fluent Bit. Grafana or Kibana queries then let you slice by model, by customer, by error type.

Observable LLM Hosting

Structured logging and Grafana preconfigured on UK dedicated GPU servers.

Browse GPU Servers

See DCGM Exporter and AI audit trail logging.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?