ARTICLES ·2026-06-03 ·BY EFFLOOW CONTENT FACTORY

OpenTelemetry GenAI: Trace LLM Agent Tool Calls

Trace LLM agent runs with OpenTelemetry GenAI spans using a local Python sandbox PoC with model and tool-call attributes.

opentelemetry genai observability llm-agents python tracing sandbox-poc

OpenTelemetry GenAI: Trace LLM Agent Tool Calls

When an LLM agent fails, the hard question is rarely "did the model answer?" It is "where did the run go wrong?" The model call may be slow, a tool may have retried, the agent may have used the wrong retrieval result, or the final answer may have hidden a failed intermediate step. Plain logs can show pieces of that story, but they usually do not preserve the hierarchy.

OpenTelemetry's GenAI semantic conventions are becoming the common vocabulary for that hierarchy. The official OpenTelemetry GenAI observability walkthrough, published May 14, 2026, shows an agent trace with a top-level invoke_agent span, child chat spans, and execute_tool spans for tool calls. The same post points to token-count attributes such as gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and finish reasons such as gen_ai.response.finish_reasons.

Effloow Lab ran a local sandbox PoC for this article. The lab installed OpenTelemetry Python packages, imported the Anthropic instrumentation package, and exported a four-span agent trace to JSON without API keys or live model calls. The evidence note is at data/lab-runs/opentelemetry-genai-llm-agent-tracing-sandbox-poc-2026.md.

Effloow Lab — Local sandbox on macOS 15.6 arm64 with Python 3.12.8, opentelemetry-sdk==1.42.1, opentelemetry-exporter-otlp==1.42.1, opentelemetry-instrumentation-anthropic==0.61.0, and no LLM API calls.

Why LLM Agent Tracing Needs a Standard

Traditional application traces already answer useful questions: which service called which dependency, how long the database query took, and where an exception appeared. Agent traces need those answers plus a few GenAI-specific details.

For an agent run, the important units are not only HTTP requests. They are model calls, tool calls, retrieval calls, handoffs, prompt events, completion events, token usage, and sometimes agent-to-agent delegation. If every framework invents its own names for those units, observability becomes vendor-specific. A trace emitted by a coding agent, a customer-support agent, and a workflow agent may all describe the same shape with incompatible fields.

The current OpenTelemetry GenAI convention gives teams a shared naming layer. The official semantic-convention docs define GenAI signals for events, exceptions, metrics, model spans, agent spans, and framework spans. The client-span docs describe a model inference span as a client call to a GenAI model or service, with required attributes such as gen_ai.operation.name and gen_ai.provider.name when available. The same docs define execute_tool as the operation name for tool execution spans and recommend gen_ai.tool.name plus gen_ai.tool.call.id when those values exist.

That standardization matters most when an agent is connected to production tools. A trace can show whether the agent called the model twice, whether a tool call was responsible for latency, and whether the model stopped because it requested a tool or because it finished normally. Without this structure, teams often debug agent failures by reading unstructured logs and hoping the right correlation ID survived.

Current Status: Useful, but Still Moving

This is not a "set it once and forget it" spec. As of the current OpenTelemetry docs reviewed on June 3, 2026, many GenAI semantic-convention fields are marked Development. The GenAI docs also describe a transition plan for instrumentation libraries, including OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental for libraries that can emit newer convention versions.

That has two practical consequences.

First, production systems should tolerate both older and newer attribute names during the transition. For example, many current examples and libraries still emit gen_ai.system, while newer convention text emphasizes gen_ai.provider.name. In the sandbox PoC, Effloow Lab wrote both attributes on the simulated Anthropic chat span:

{
  "gen_ai.system": "anthropic",
  "gen_ai.provider.name": "anthropic",
  "gen_ai.request.model": "claude-sonnet-4-20250514",
  "gen_ai.response.model": "claude-sonnet-4-20250514",
  "gen_ai.usage.input_tokens": 184,
  "gen_ai.usage.output_tokens": 47
}

Second, teams should avoid building fragile dashboards that depend on a single experimental field name. Use the convention where it exists, but keep the ingestion layer able to normalize aliases. This is especially important for GenAI backends that aggregate traces from multiple SDKs, model providers, and agent frameworks.

What the Sandbox Proved

The sandbox created a temporary virtualenv under /tmp/effloow-otel-genai-poc and installed:

opentelemetry-sdk==1.42.1
opentelemetry-exporter-otlp==1.42.1
opentelemetry-instrumentation-anthropic==0.61.0
opentelemetry-semantic-conventions-ai==0.5.1

The first import attempt found a real package-path issue: importing opentelemetry.instrumentation.anthropic failed until the Anthropic client and Pydantic were also installed. After adding anthropic==0.105.2 and pydantic==2.13.4, the instrumentation package imported successfully.

Then the PoC manually emitted an agent-shaped trace with a custom JSON exporter:

span_count 4
chat CLIENT 211221f62b7b1a6e [...]
execute_tool INTERNAL 211221f62b7b1a6e [...]
execute_tool INTERNAL 211221f62b7b1a6e [...]
invoke_agent INTERNAL None [...]

The span tree had one trace ID and one root:

{
  "span_count": 4,
  "span_names": ["chat", "execute_tool", "execute_tool", "invoke_agent"],
  "trace_id": "0f1035558bef566e0d26981c0031d202"
}

The root invoke_agent span had gen_ai.operation.name=invoke_agent and gen_ai.agent.name=local-research-assistant. The chat span had model, provider, token-count, and finish-reason attributes. The two execute_tool spans had gen_ai.operation.name=execute_tool, gen_ai.tool.name, and gen_ai.tool.call.id.

This proves the local instrumentation shape, not production correctness. No live Claude or OpenAI request was made. No provider token accounting was verified. No Jaeger UI screenshot was captured. The Docker attempt to run jaegertracing/all-in-one:latest blocked in credential lookup while pulling the image, so the lab stopped that path and kept the backend limitation explicit.

Reproduce the Local Trace Export

Create a throwaway sandbox:

rm -rf /tmp/effloow-otel-genai-poc
mkdir -p /tmp/effloow-otel-genai-poc
python3 -m venv /tmp/effloow-otel-genai-poc/.venv
/tmp/effloow-otel-genai-poc/.venv/bin/python -m pip install --upgrade pip

Install the packages:

/tmp/effloow-otel-genai-poc/.venv/bin/python -m pip install \
  opentelemetry-sdk==1.42.1 \
  opentelemetry-exporter-otlp==1.42.1 \
  opentelemetry-instrumentation-anthropic==0.61.0 \
  anthropic \
  pydantic

The important pattern is to initialize a TracerProvider, attach an exporter, then create nested spans. A simplified version:

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import SpanKind

provider = TracerProvider(
    resource=Resource.create({"service.name": "agent-tracing-demo"})
)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("demo.genai")

with tracer.start_as_current_span("invoke_agent", kind=SpanKind.INTERNAL) as root:
    root.set_attribute("gen_ai.operation.name", "invoke_agent")
    root.set_attribute("gen_ai.agent.name", "local-research-assistant")

    with tracer.start_as_current_span("chat", kind=SpanKind.CLIENT) as chat:
        chat.set_attribute("gen_ai.operation.name", "chat")
        chat.set_attribute("gen_ai.provider.name", "anthropic")
        chat.set_attribute("gen_ai.request.model", "claude-sonnet-4-20250514")
        chat.set_attribute("gen_ai.usage.input_tokens", 184)
        chat.set_attribute("gen_ai.usage.output_tokens", 47)

    with tracer.start_as_current_span("execute_tool", kind=SpanKind.INTERNAL) as tool:
        tool.set_attribute("gen_ai.operation.name", "execute_tool")
        tool.set_attribute("gen_ai.tool.name", "search_docs")
        tool.set_attribute("gen_ai.tool.call.id", "toolu_001")

This is enough to validate the span hierarchy before wiring a real provider SDK. Once a real backend is available, swap the console or JSON exporter for OTLP and send traces to a collector or observability backend.

What to Instrument in a Real Agent

Start with the trace tree, not with dashboards. A useful production trace should let an engineer answer five questions quickly:

Which agent run is this?
Which model calls happened?
Which tools executed?
Which step consumed time, retries, or tokens?
Which sensitive content was intentionally not recorded?

For most teams, the first useful span layout looks like this:

invoke_agent
  chat claude-sonnet-4
  execute_tool search_docs
  chat claude-sonnet-4
  execute_tool create_ticket
  chat claude-sonnet-4

Use model spans for provider calls. Use tool spans for function tools, MCP tools, retrieval tools, database reads, file edits, or workflow actions. Add token counts when the provider returns them. Add finish reasons when the provider exposes them. Record exceptions on spans instead of burying them in logs.

Do not record full prompts, tool arguments, or tool results by default. The OpenTelemetry blog notes that content capture is opt-in because prompts and tool payloads may contain sensitive data. In the Effloow sandbox, prompt and payload content was intentionally represented only as content_recorded=false and payload_recorded=false event attributes.

Collector and Backend Path

OpenTelemetry's Collector is the normal production bridge between instrumented services and backends. The official Collector docs describe it as a vendor-agnostic way to receive, process, and export telemetry data. The docs also note why a collector is useful beyond local development: retries, batching, encryption, and sensitive-data filtering can live in the collector instead of every application service.

For a GenAI agent service, a reasonable path is:

agent app
  -> OTLP exporter
  -> local or sidecar OpenTelemetry Collector
  -> processor pipeline for batching and redaction
  -> Jaeger, Tempo, Honeycomb, Datadog, New Relic, or another backend

The sandbox did not complete this backend path because the local Jaeger Docker pull blocked on credential lookup. That limitation matters. A JSON trace proves the span shape; a backend ingest test proves that the pipeline, collector config, and UI can preserve that shape. Treat those as separate checks.

Common Mistakes

The first mistake is tracing only the model call. A model-only trace can show latency and token usage, but it cannot explain whether a tool was slow, whether the agent loop repeated, or whether a retrieval step returned bad context.

The second mistake is recording too much content. Full prompts, tool arguments, and tool results are attractive during development and dangerous in production. If you enable content capture, pair it with retention limits, redaction, access control, and a clear reason.

The third mistake is pretending the conventions are fully stable. They are useful today, but teams should expect field-name movement. Normalize at ingestion and keep dashboards focused on a small set of durable fields: operation name, provider, requested model, response model, tool name, tool call ID, duration, error type, and token counts.

The fourth mistake is treating observability as safety. A trace can show what happened. It does not approve tool use, block prompt injection, enforce data policy, or validate outputs. For agent safety, combine tracing with guardrails, tool approval, scoped credentials, and runtime policy checks. Effloow's OpenAI Agents SDK guardrails PoC covers a separate local pattern for tripwire testing.

FAQ

Q: Is OpenTelemetry GenAI ready for production LLM agents?

It is ready enough to pilot for traces, metrics, and events, but the GenAI semantic conventions are still in Development status in the current docs. Use them, but normalize changing attributes and avoid assuming every SDK emits the same field set.

Q: Do I need Jaeger to use OpenTelemetry for LLM tracing?

No. Jaeger is one possible backend. OpenTelemetry emits telemetry through SDKs and exporters, commonly through OTLP. You can send traces to an OpenTelemetry Collector and then to any compatible backend. The Effloow sandbox used a JSON exporter because the local Jaeger Docker image pull did not complete.

Q: Should I record prompts and tool results in spans?

Default to no. Record model names, operation names, tool names, token counts, durations, finish reasons, and errors first. Full prompts and tool payloads may contain secrets or customer data, so they should be opt-in and governed.

Q: What is the minimum useful agent trace?

One root run span, model-call spans, and tool-call spans. If you can see invoke_agent -> chat -> execute_tool -> chat, you can already debug more than a flat log stream.

Key Takeaways

OpenTelemetry GenAI tracing is useful because it makes an agent run inspectable as a hierarchy. The model call, tool calls, token usage, finish reasons, and errors can live in one trace instead of scattered logs.

The Effloow Lab PoC proved a narrow but practical point: a local Python app can emit an agent-shaped OpenTelemetry trace with GenAI-style attributes and no API key. It did not prove live Anthropic/OpenAI auto-instrumentation, Jaeger rendering, provider token accounting, or production collector behavior.

For production, start small: emit the span tree, keep content capture off by default, normalize convention changes, route through a collector when the service becomes real, and treat tracing as observability rather than policy enforcement.

Bottom Line

OpenTelemetry GenAI is the right direction for agent observability, but the responsible rollout is incremental: prove the trace shape locally, keep sensitive payloads out, then validate backend ingest before depending on it during incidents.

Sources

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →