RTX 3050 - Order Now
Home / Blog / Tutorials / AI Runtime Tracing with OpenTelemetry
Tutorials

AI Runtime Tracing with OpenTelemetry

OpenTelemetry instrumentation for AI applications — traces from gateway through embeddings, retrieval, LLM, response.

OpenTelemetry is the standard distributed tracing framework. For AI applications with multiple service hops (gateway → embeddings → vector store → LLM → response), traces are essential for diagnosing latency issues. AI-specific span attributes capture model + tokens + cost per hop.

TL;DR

Add OTel SDK to your AI app; instrument each service hop as a span; ship to Jaeger / Honeycomb / Grafana Tempo. AI-specific attributes: model, prompt_tokens, completion_tokens, cost_usd. One trace per request from gateway to response. Diagnoses latency / cost / failure root causes in seconds.

Why OTel

  • Distributed traces: a slow request traverses gateway / vector / LLM — trace shows where time was spent
  • Standard format: any compatible backend (Jaeger, Honeycomb, Datadog, Grafana Tempo)
  • Vendor neutrality: switch backends without code changes
  • AI-specific attributes: capture model / tokens / cost per span
  • Sampling: trace 1-10% of requests; full coverage on errors

Setup

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("rag.query") as span:
    span.set_attribute("user_id", user_id)

    with tracer.start_as_current_span("embed"):
        emb = embed_query(query)

    with tracer.start_as_current_span("retrieve") as r:
        r.set_attribute("k", 10)
        chunks = vector_store.search(emb, k=10)

    with tracer.start_as_current_span("llm.generate") as l:
        l.set_attribute("model", "llama-3.1-8b-fp8")
        l.set_attribute("prompt_tokens", count_tokens(prompt))
        response = llm.generate(prompt)
        l.set_attribute("completion_tokens", count_tokens(response))

AI-specific spans

Per-span attributes for AI workloads:

  • model: which model was called
  • prompt_tokens, completion_tokens
  • cost_usd or cost_gbp per call
  • cache_hit: prefix or semantic
  • fallback: was hosted-API fallback used?
  • tenant_id, feature_id, request_id

Verdict

For production AI applications with multi-service hops, OpenTelemetry tracing is essential. Standard format + AI-specific attributes + flexible backends. Setup is ~half-day; the value during incident response and performance debugging is decisive. Build day-one of production deployment.

Bottom line

OTel + AI-specific attributes. See obs stack.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?