Table of Contents
OpenTelemetry is the standard distributed tracing framework. For AI applications with multiple service hops (gateway → embeddings → vector store → LLM → response), traces are essential for diagnosing latency issues. AI-specific span attributes capture model + tokens + cost per hop.
Add OTel SDK to your AI app; instrument each service hop as a span; ship to Jaeger / Honeycomb / Grafana Tempo. AI-specific attributes: model, prompt_tokens, completion_tokens, cost_usd. One trace per request from gateway to response. Diagnoses latency / cost / failure root causes in seconds.
Why OTel
- Distributed traces: a slow request traverses gateway / vector / LLM — trace shows where time was spent
- Standard format: any compatible backend (Jaeger, Honeycomb, Datadog, Grafana Tempo)
- Vendor neutrality: switch backends without code changes
- AI-specific attributes: capture model / tokens / cost per span
- Sampling: trace 1-10% of requests; full coverage on errors
Setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("rag.query") as span:
span.set_attribute("user_id", user_id)
with tracer.start_as_current_span("embed"):
emb = embed_query(query)
with tracer.start_as_current_span("retrieve") as r:
r.set_attribute("k", 10)
chunks = vector_store.search(emb, k=10)
with tracer.start_as_current_span("llm.generate") as l:
l.set_attribute("model", "llama-3.1-8b-fp8")
l.set_attribute("prompt_tokens", count_tokens(prompt))
response = llm.generate(prompt)
l.set_attribute("completion_tokens", count_tokens(response))
AI-specific spans
Per-span attributes for AI workloads:
model: which model was calledprompt_tokens,completion_tokenscost_usdorcost_gbpper callcache_hit: prefix or semanticfallback: was hosted-API fallback used?tenant_id,feature_id,request_id
Verdict
For production AI applications with multi-service hops, OpenTelemetry tracing is essential. Standard format + AI-specific attributes + flexible backends. Setup is ~half-day; the value during incident response and performance debugging is decisive. Build day-one of production deployment.
Bottom line
OTel + AI-specific attributes. See obs stack.