Home / Blog / Tutorials / Structured Logging for LLM Inference

Tutorials

Structured Logging for LLM Inference

Text logs are hard to search. Structured JSON logs let you aggregate, alert, and debug efficiently on a dedicated GPU serving production traffic.

Tutorials April 23, 2026 2 min read gigagpu

Production LLM serving generates high-volume logs. Unstructured text lines are a pain to search and aggregate. Structured JSON logs make everything from alerting to debugging tractable. On our dedicated GPU hosting this is the logging default we recommend.

What to log
Format
Implementation
Aggregation

What to Log

Per request:

Request ID (trace-correlatable)
User or API key (hashed)
Model name and version
Input token count and output token count
Time-to-first-token
Total latency
Status (success, truncated, error)
Temperature, top_p, max_tokens

Do not log prompts or completions by default – it’s expensive and privacy-sensitive. Opt in per customer if they need debug logging.

Format

{
  "timestamp": "2029-08-15T10:23:45.123Z",
  "level": "INFO",
  "event": "inference_complete",
  "request_id": "req_xyz",
  "api_key_hash": "sha256:abc",
  "model": "llama-3.3-70b",
  "input_tokens": 247,
  "output_tokens": 112,
  "ttft_ms": 380,
  "total_ms": 4120,
  "status": "success"
}

Every field is a known key. No free-text messages that require regex parsing.

Implementation

For a FastAPI wrapper around vLLM:

import structlog

log = structlog.get_logger()

@app.middleware("http")
async def log_requests(request, call_next):
    start = time.time()
    response = await call_next(request)
    log.info(
        "inference_complete",
        request_id=request.state.request_id,
        method=request.method,
        path=request.url.path,
        status=response.status_code,
        duration_ms=int((time.time() - start) * 1000),
    )
    return response

For vLLM’s built-in logging, redirect stdout through a JSON formatter or use Python’s logging config with python-json-logger.

Aggregation

Ship JSON logs to Loki, Elasticsearch, or an S3-compatible bucket via Vector or Fluent Bit. Grafana or Kibana queries then let you slice by model, by customer, by error type.

Observable LLM Hosting

Structured logging and Grafana preconfigured on UK dedicated GPU servers.

Browse GPU Servers

See DCGM Exporter and AI audit trail logging.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Structured Logging for LLM Inference

Contents

What to Log

Format

Implementation

Aggregation

Observable LLM Hosting

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Structured Logging for LLM Inference

Contents

What to Log

Format

Implementation

Aggregation

Observable LLM Hosting

Need a Dedicated GPU Server?

gigagpu

Related Articles

Migrate from OpenAI to Self-Hosted: Embeddings Pipeline Guide

LangChain Agents vs LlamaIndex Agents

vLLM Chat Template Errors: Fixing Tokenizer Issues

Embedding Model Retraining Cadence

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?