LLM Tracing 101: How to Debug Your AI Application in Production
Learn how to implement LLM tracing to debug agents, optimize performance, and reduce costs. Complete guide with code examples for production AI systems.
Key Takeaways
- Traditional logging fails for non-deterministic LLM applications with multi-step workflows
- LLM tracing captures the complete execution path as hierarchical spans with attributes
- Traces help identify performance bottlenecks, hallucinations, cost spikes, and infinite loops
- Three implementation approaches: manual instrumentation, framework auto-instrumentation, or observability platforms
- Start by tracing critical paths, then expand coverage as you see value
You deploy your LLM application. Users start reporting issues: "The chatbot gave me a weird answer." "It took forever to respond." "The costs are way higher than expected."
You open your logs and find... print statements. Timestamps. Maybe some error messages. But nothing that tells you what actually happened inside your AI agent's reasoning process.
Welcome to the world of LLM debugging, where traditional tools fall short and your "it worked in testing" confidence crumbles in production.
Why LLMs Are Hard to Debug
If you've built traditional backend systems, you know the debugging playbook: check the logs, trace the request, inspect the database state. With LLMs, that playbook breaks down immediately.
Non-deterministic outputs mean the same input can produce different results. You can't just replay a request and expect the same behavior. Temperature settings, model updates, and sampling randomness all introduce variability that makes reproducibility challenging.
Multi-step agent workflows compound this complexity. A single user query might trigger:
- Initial LLM call to plan the approach
- Three tool invocations to fetch data
- A second LLM call to synthesize results
- A final formatting step
If the output is wrong, which step failed? Traditional logs show you the start and end, but the branching logic in between remains invisible.
Invisible token consumption means you don't know where your costs are coming from. Your billing dashboard shows 10 million tokens used yesterday, but which prompts consumed them? Was it the verbose system instructions? The debugging context you forgot to remove? The agent that looped 47 times before giving up?
And then there's the classic "it worked in testing" problem. Your evaluation set passes. Your integration tests are green. But in production, edge cases emerge: user queries you never anticipated, data formats that break your prompts, rate limits that cause cascading failures.
Traditional logging, with its sequential text output, can't capture the tree-like execution of an agent making decisions, spawning parallel tasks, and maintaining context across steps. You need something more structured.
What is LLM Tracing?
LLM tracing captures the full execution path of an AI request as a structured, hierarchical record. Think of it as a flight recorder for your AI application - every decision, every API call, every token consumed, timestamped and organized.
Traces vs Logs vs Metrics
Understanding the difference between observability primitives is crucial for debugging LLM applications effectively:
| Type | Purpose | Example | Best For |
|---|---|---|---|
| Logs | Discrete events | "User sent message", "API call completed" | Point-in-time debugging |
| Metrics | Aggregated numbers | "Average latency: 2.3s", "Total tokens: 10M" | Trend analysis, alerting |
| Traces | Complete request journey | Full execution path with timing | Root cause analysis |
A trace consists of spans - units of work with start/end times, attributes, and parent-child relationships. For an LLM application:
Trace: User asks "What's the weather in Paris?"
├─ Span: Process user query (parent span)
│ ├─ Span: LLM call - intent classification
│ │ └─ Attributes: model=gpt-4, tokens=45, latency=320ms
│ ├─ Span: Tool call - weather API
│ │ └─ Attributes: tool=get_weather, location=Paris, latency=180ms
│ └─ Span: LLM call - format response
│ └─ Attributes: model=gpt-4, tokens=67, latency=290msThis hierarchical structure lets you see not just what happened, but when it happened, in what order, and with what data.
Tracing Simple LLM Calls
Let's start with the basics: tracing a single OpenAI API call. Here's what a naive implementation might look like:
import openai
import time
# Before: no visibility
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)Now let's add tracing. At minimum, you want to capture:
- Prompt (input messages)
- Response (completion text)
- Latency (how long it took)
- Tokens (input + output counts)
- Model parameters (model name, temperature, max_tokens)
import openai
import time
import json
def traced_llm_call(messages, model="gpt-4", temperature=0.7):
start_time = time.time()
trace = {
"timestamp": start_time,
"type": "llm_call",
"model": model,
"temperature": temperature,
"input": messages
}
try:
response = openai.chat.completions.create(
model=model,
messages=messages,
temperature=temperature
)
trace["output"] = response.choices[0].message.content
trace["tokens_input"] = response.usage.prompt_tokens
trace["tokens_output"] = response.usage.completion_tokens
trace["tokens_total"] = response.usage.total_tokens
trace["latency_ms"] = (time.time() - start_time) * 1000
trace["status"] = "success"
# Save or send this trace to your observability system
save_trace(trace)
return response
except Exception as e:
trace["status"] = "error"
trace["error"] = str(e)
trace["latency_ms"] = (time.time() - start_time) * 1000
save_trace(trace)
raise
# Usage
response = traced_llm_call(
messages=[{"role": "user", "content": "Explain quantum computing"}],
model="gpt-4",
temperature=0.7
)This gives you the raw data you need to debug issues. When a user complains about a response, you can look up the trace and see exactly what prompt was sent and what came back.
Tracing Multi-Step Agents
Real-world applications rarely involve a single LLM call. Agents combine reasoning, tool use, and iteration. Here's where tracing becomes essential.
Consider a LangChain agent that can search the web and perform calculations:
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
import time
# Without tracing, this is a black box
def ask_agent(question):
tools = [
Tool(name="Search", func=search_web),
Tool(name="Calculator", func=calculate)
]
agent = initialize_agent(tools, OpenAI(temperature=0), verbose=True)
return agent.run(question)
# With tracing, we see each step
class TracedAgent:
def __init__(self):
self.trace_id = generate_trace_id()
self.spans = []
def create_span(self, name, parent_id=None):
span_id = generate_span_id()
span = {
"trace_id": self.trace_id,
"span_id": span_id,
"parent_id": parent_id,
"name": name,
"start_time": time.time(),
"attributes": {}
}
return span_id, span
def end_span(self, span_id, span, **attributes):
span["end_time"] = time.time()
span["duration_ms"] = (span["end_time"] - span["start_time"]) * 1000
span["attributes"].update(attributes)
self.spans.append(span)
save_span(span)
def run(self, question):
# Root span
root_span_id, root_span = self.create_span("agent_execution")
try:
# Planning step
plan_span_id, plan_span = self.create_span("llm_planning", root_span_id)
plan = self._plan_action(question)
self.end_span(plan_span_id, plan_span,
step="planning",
model="gpt-4",
tokens=plan.tokens)
# Tool execution (could be multiple)
step_num = 1
while not plan.is_complete:
tool_span_id, tool_span = self.create_span(
f"tool_execution_{step_num}",
root_span_id
)
result = self._execute_tool(plan.tool_name, plan.tool_input)
self.end_span(tool_span_id, tool_span,
step=step_num,
tool=plan.tool_name,
input=plan.tool_input,
output=result)
# Reasoning after tool use
reason_span_id, reason_span = self.create_span(
f"llm_reasoning_{step_num}",
root_span_id
)
plan = self._reason_next_step(result)
self.end_span(reason_span_id, reason_span,
step=step_num,
model="gpt-4",
tokens=plan.tokens)
step_num += 1
# Final response
final_span_id, final_span = self.create_span("final_response", root_span_id)
response = self._generate_response(plan.final_answer)
self.end_span(final_span_id, final_span,
model="gpt-4",
tokens=response.tokens)
self.end_span(root_span_id, root_span,
status="success",
total_steps=step_num)
return response.text
except Exception as e:
self.end_span(root_span_id, root_span,
status="error",
error=str(e))
raiseNow when the agent executes, you get a complete picture:
Trace ID: abc123
├─ Span: agent_execution (2.4s)
│ ├─ Span: llm_planning (340ms)
│ │ └─ Attributes: model=gpt-4, tokens=128
│ ├─ Span: tool_execution_1 (120ms)
│ │ └─ Attributes: tool=Search, input="Python tutorials", output="..."
│ ├─ Span: llm_reasoning_1 (280ms)
│ │ └─ Attributes: model=gpt-4, tokens=156
│ ├─ Span: tool_execution_2 (95ms)
│ │ └─ Attributes: tool=Calculator, input="2+2", output="4"
│ ├─ Span: llm_reasoning_2 (310ms)
│ │ └─ Attributes: model=gpt-4, tokens=142
│ └─ Span: final_response (290ms)
│ └─ Attributes: model=gpt-4, tokens=89This structure makes it obvious if a tool is slow, if the agent is using too many steps, or if one particular LLM call is consuming excessive tokens.
Tracing RAG Pipelines
Retrieval-Augmented Generation (RAG) adds another layer of complexity: you're debugging both retrieval quality and generation quality. A trace helps you isolate which component is failing.
from typing import List, Dict
import time
class TracedRAGPipeline:
def __init__(self, vector_db, llm):
self.vector_db = vector_db
self.llm = llm
self.trace_id = generate_trace_id()
def query(self, question: str, top_k: int = 5) -> str:
root_span_id, root_span = self.create_span("rag_query")
# 1. Query embedding
embed_span_id, embed_span = self.create_span("embed_query", root_span_id)
query_embedding = self._embed(question)
self.end_span(embed_span_id, embed_span,
input_length=len(question),
vector_dimensions=len(query_embedding))
# 2. Vector search
search_span_id, search_span = self.create_span("vector_search", root_span_id)
results = self.vector_db.search(query_embedding, top_k=top_k)
self.end_span(search_span_id, search_span,
top_k=top_k,
results_returned=len(results),
similarity_scores=[r.score for r in results])
# 3. Reranking (optional but common)
rerank_span_id, rerank_span = self.create_span("rerank", root_span_id)
reranked = self._rerank(question, results)
self.end_span(rerank_span_id, rerank_span,
input_count=len(results),
output_count=len(reranked))
# 4. Context assembly
context_span_id, context_span = self.create_span("build_context", root_span_id)
context = self._build_context(reranked)
self.end_span(context_span_id, context_span,
context_length=len(context),
num_chunks=len(reranked))
# 5. LLM generation
gen_span_id, gen_span = self.create_span("llm_generation", root_span_id)
prompt = self._build_prompt(question, context)
response = self.llm.generate(prompt)
self.end_span(gen_span_id, gen_span,
model=self.llm.model_name,
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
context_included=context[:500]) # truncated for storage
self.end_span(root_span_id, root_span, status="success")
return response.textThis trace structure lets you answer questions like:
- "Was the right content retrieved?" (check similarity scores in vector_search span)
- "Is reranking helping?" (compare before/after in rerank span)
- "Is the context too long?" (check context_length in build_context span)
- "Where's the latency bottleneck?" (compare span durations)
Common Debugging Scenarios
Let's walk through four real-world debugging scenarios where traces save the day.
Scenario 1: "Why is this so slow?"
A user reports your RAG chatbot takes 8 seconds to respond. You check the trace:
Trace: User query "What is your refund policy?"
├─ embed_query: 45ms
├─ vector_search: 120ms
├─ rerank: 4,200ms ← BOTTLENECK
├─ build_context: 15ms
└─ llm_generation: 890msThe reranker is taking 4.2 seconds! You investigate and discover it's a cross-encoder model running on CPU. You switch to a faster model or move it to GPU, reducing the time to 200ms.
Without tracing, you might have optimized the LLM call (the most obvious suspect) and made no meaningful improvement.
Scenario 2: "Why did it hallucinate?"
Your Q&A bot confidently states a fact that's completely wrong. You pull up the trace:
Span: vector_search
Attributes:
top_k: 5
results_returned: 5
similarity_scores: [0.42, 0.38, 0.35, 0.33, 0.31]
Span: build_context
Attributes:
context_length: 1200
context_included: "... unrelated content about shipping policies ..."The similarity scores are low (< 0.5), and the context is about shipping, not the question asked. The retrieval failed, so the LLM had to guess. You realize your embedding model doesn't understand domain-specific terminology and needs fine-tuning.
Scenario 3: "Why did costs spike?"
Your monthly bill jumps from $500 to $3,200. You query your traces for high token usage:
SELECT trace_id, SUM(tokens_total) as total_tokens
FROM spans
WHERE timestamp > '2024-01-01'
GROUP BY trace_id
ORDER BY total_tokens DESC
LIMIT 10You discover traces with 15,000+ tokens each. Looking at one:
Span: llm_generation
Attributes:
prompt_tokens: 12,800
completion_tokens: 450
context_included: "... [50 full documents] ..."Your context assembly is including entire documents instead of relevant chunks. A quick fix to limit context to 3,000 tokens cuts costs by 70%.
Scenario 4: "Why did the agent loop infinitely?"
Your agent sometimes runs for minutes and times out. You trace a failing execution:
Trace: User query "Calculate the ROI of this investment"
├─ Step 1: llm_planning → tool=Calculator
├─ Step 2: tool_execution → error: "division by zero"
├─ Step 3: llm_reasoning → tool=Calculator (same input!)
├─ Step 4: tool_execution → error: "division by zero"
├─ Step 5: llm_reasoning → tool=Calculator (same input!)
...
├─ Step 47: llm_reasoning → tool=Calculator (same input!)
└─ TIMEOUTThe agent can't handle the tool error and keeps retrying with the same input. You add error handling to the agent prompt: "If a tool returns an error, try a different approach or ask the user for clarification."
Implementing Tracing: Three Approaches
Now that you understand what tracing captures and why it matters, how do you actually implement it?
Approach 1: Manual Instrumentation
You write the tracing code yourself, as shown in the examples above.
Pros:
- Full control over what's captured
- No external dependencies
- Works with any stack
Cons:
- Tedious and error-prone
- Easy to forget to trace something
- No visualization tools included
When to use: Learning, simple applications, or when you need custom attributes that no library supports.
Approach 2: Framework Auto-Instrumentation
LangChain, LlamaIndex, and other frameworks provide built-in tracing:
# LangChain example
from langchain.callbacks import TraceCallbackHandler
tracer = TraceCallbackHandler(
project_name="my-agent",
tags=["production"]
)
agent = initialize_agent(
tools=tools,
llm=llm,
callbacks=[tracer]
)
# Now every step is automatically traced
response = agent.run("What's the weather in Paris?")Pros:
- Easy setup (a few lines)
- Comprehensive coverage of framework operations
- Community support
Cons:
- Framework lock-in
- May capture too much or too little
- Limited customization
When to use: You're already using one of these frameworks and want quick results.
Approach 3: Observability Platform SDK
Services like OpenTelemetry, LangSmith, or specialized LLM observability platforms provide SDKs:
from llm_observability import trace, init
# One-time initialization
init(api_key="your-key", project="my-app")
# Decorator-based tracing
@trace(name="rag_query")
def query_knowledge_base(question: str) -> str:
# Your code here - automatically traced
embedding = embed(question)
results = search(embedding)
return generate_response(results)
# Works with any code, any framework
response = query_knowledge_base("What is your refund policy?")Pros:
- Works with any code or framework
- Includes UI for visualization and analysis
- Production-ready (sampling, retention, alerting)
- Minimal code changes
Cons:
- Vendor dependency
- May send data outside your infrastructure
- Cost for high-volume applications
When to use: Production applications where you need reliability, team collaboration, and long-term retention.
Tracing Best Practices
As you implement LLM tracing, follow these guidelines to maximize value and minimize overhead.
Essential Trace Attributes Checklist
| Category | What to Capture | Why It Matters |
|---|---|---|
| Identification | Request ID, timestamp, user/session ID | Track related requests, reproduce issues |
| Model Config | Model name, temperature, max_tokens, stop sequences | Understand behavior variations |
| Inputs | Prompt text (or hash if sensitive) | Debug hallucinations, verify context |
| Outputs | Response text (or hash if sensitive) | Validate quality, catch regressions |
| Resources | Token counts (input, output, total) | Cost attribution, optimization |
| Performance | Latency (wall-clock time), span durations | Identify bottlenecks |
| Status | Success, error, timeout, error messages | Reliability monitoring |
What NOT to Capture
Privacy and Performance Warnings
- Full documents or proprietary data (use summaries or content hashes)
- PII (personally identifiable information) unless you have explicit consent
- API keys or credentials in attributes
- Massive context windows verbatim (truncate or sample to first/last N tokens)
Sampling strategies: When you're processing millions of requests, storing every trace becomes expensive. Implement sampling:
import random
def should_trace(request) -> bool:
# Always trace errors
if request.has_error:
return True
# Always trace slow requests
if request.latency > 5000: # 5 seconds
return True
# Sample 10% of normal requests
return random.random() < 0.10Retention policies: Define how long to keep traces. Common approach:
- Errors: 90 days
- Slow requests (p99): 30 days
- Normal requests: 7 days
Hands-On: Add Tracing to a Simple App
Let's take a simple Q&A bot and add comprehensive tracing in about 20 lines of code.
Before (blind debugging):
def answer_question(question: str) -> str:
print(f"Received question: {question}")
# Retrieve context
context = knowledge_base.search(question)
print(f"Found {len(context)} results")
# Generate answer
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
print(f"Generated answer: {answer[:100]}...")
return answerWhen something goes wrong, you squint at print statements and guess.
After (full visibility):
from llm_observability import trace, span
@trace(name="answer_question")
def answer_question(question: str) -> str:
trace.set_attribute("question_length", len(question))
# Retrieve context
with span("knowledge_base_search"):
context = knowledge_base.search(question)
span.set_attribute("results_count", len(context))
span.set_attribute("avg_similarity", sum(r.score for r in context) / len(context))
# Generate answer
with span("llm_generation") as llm_span:
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
llm_span.set_attribute("prompt_length", len(prompt))
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
llm_span.set_attribute("model", "gpt-4")
llm_span.set_attribute("prompt_tokens", response.usage.prompt_tokens)
llm_span.set_attribute("completion_tokens", response.usage.completion_tokens)
answer = response.choices[0].message.content
trace.set_attribute("answer_length", len(answer))
return answerNow every execution produces a structured trace you can query, filter, and visualize. You can answer questions like:
- "Show me all requests where similarity was below 0.5"
- "What's the p95 latency for the llm_generation span?"
- "Which questions produce the longest prompts?"
Getting Started with LLM Tracing
Tracing transforms LLM debugging from guesswork to science. Start small and expand coverage as you see value:
Your First Week Implementation Plan
Day 1-2: Pick one critical path
└─ Identify your main query handler or most-used agent workflow
Day 3-4: Add manual instrumentation
└─ Capture: prompts, responses, tokens, latency, model config
Day 5: Visualize traces
└─ Log to structured format (JSON) and view in trace viewer
Day 6-7: Expand coverage
└─ Add tracing to top 3 workflows based on traffic/importance
Week 2+: Production-grade setup
└─ Consider observability platform for team collaborationThree Implementation Approaches Compared
| Approach | Setup Time | Flexibility | Best For |
|---|---|---|---|
| Manual Instrumentation | 2-4 hours | High | Learning, simple apps, custom attributes |
| Framework Auto-Instrumentation | 30 minutes | Medium | LangChain/LlamaIndex users, quick wins |
| Observability Platform | 1 hour | High | Production apps, team collaboration, long-term retention |
Conclusion
The difference between debugging with and without LLM tracing is night and day. With traces, you can confidently answer "what happened?" instead of guessing. You'll find performance bottlenecks, catch quality regressions, and optimize costs with actual data.
Your future self, debugging a production incident at 2 AM, will thank you for implementing tracing today.
Related Articles
- Complete Guide to LLM Observability - Broader context on monitoring AI systems
- Running Agents in Production - Advanced agent debugging techniques
- Cut LLM Costs by 40% - Cost optimization strategies using trace data
Ready to see your traces in action? Start with our 5-minute quick start guide to add comprehensive tracing to your LLM application with just a few lines of code.