← Back to Blog
2026-01-28

LLM Tracing 101: How to Debug Your AI Application in Production

Learn how to implement LLM tracing to debug agents, optimize performance, and reduce costs. Complete guide with code examples for production AI systems.

Key Takeaways

- Traditional logging fails for non-deterministic LLM applications with multi-step workflows

- LLM tracing captures the complete execution path as hierarchical spans with attributes

- Traces help identify performance bottlenecks, hallucinations, cost spikes, and infinite loops

- Three implementation approaches: manual instrumentation, framework auto-instrumentation, or observability platforms

- Start by tracing critical paths, then expand coverage as you see value

You deploy your LLM application. Users start reporting issues: "The chatbot gave me a weird answer." "It took forever to respond." "The costs are way higher than expected."

You open your logs and find... print statements. Timestamps. Maybe some error messages. But nothing that tells you what actually happened inside your AI agent's reasoning process.

Welcome to the world of LLM debugging, where traditional tools fall short and your "it worked in testing" confidence crumbles in production.

Why LLMs Are Hard to Debug

If you've built traditional backend systems, you know the debugging playbook: check the logs, trace the request, inspect the database state. With LLMs, that playbook breaks down immediately.

Non-deterministic outputs mean the same input can produce different results. You can't just replay a request and expect the same behavior. Temperature settings, model updates, and sampling randomness all introduce variability that makes reproducibility challenging.

Multi-step agent workflows compound this complexity. A single user query might trigger:

  • Initial LLM call to plan the approach
  • Three tool invocations to fetch data
  • A second LLM call to synthesize results
  • A final formatting step

If the output is wrong, which step failed? Traditional logs show you the start and end, but the branching logic in between remains invisible.

Invisible token consumption means you don't know where your costs are coming from. Your billing dashboard shows 10 million tokens used yesterday, but which prompts consumed them? Was it the verbose system instructions? The debugging context you forgot to remove? The agent that looped 47 times before giving up?

And then there's the classic "it worked in testing" problem. Your evaluation set passes. Your integration tests are green. But in production, edge cases emerge: user queries you never anticipated, data formats that break your prompts, rate limits that cause cascading failures.

Traditional logging, with its sequential text output, can't capture the tree-like execution of an agent making decisions, spawning parallel tasks, and maintaining context across steps. You need something more structured.

What is LLM Tracing?

LLM tracing captures the full execution path of an AI request as a structured, hierarchical record. Think of it as a flight recorder for your AI application - every decision, every API call, every token consumed, timestamped and organized.

Traces vs Logs vs Metrics

Understanding the difference between observability primitives is crucial for debugging LLM applications effectively:

TypePurposeExampleBest For
LogsDiscrete events"User sent message", "API call completed"Point-in-time debugging
MetricsAggregated numbers"Average latency: 2.3s", "Total tokens: 10M"Trend analysis, alerting
TracesComplete request journeyFull execution path with timingRoot cause analysis

A trace consists of spans - units of work with start/end times, attributes, and parent-child relationships. For an LLM application:

Trace: User asks "What's the weather in Paris?"
├─ Span: Process user query (parent span)
│  ├─ Span: LLM call - intent classification
│  │  └─ Attributes: model=gpt-4, tokens=45, latency=320ms
│  ├─ Span: Tool call - weather API
│  │  └─ Attributes: tool=get_weather, location=Paris, latency=180ms
│  └─ Span: LLM call - format response
│     └─ Attributes: model=gpt-4, tokens=67, latency=290ms

This hierarchical structure lets you see not just what happened, but when it happened, in what order, and with what data.

Tracing Simple LLM Calls

Let's start with the basics: tracing a single OpenAI API call. Here's what a naive implementation might look like:

import openai
import time

# Before: no visibility
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Now let's add tracing. At minimum, you want to capture:

  • Prompt (input messages)
  • Response (completion text)
  • Latency (how long it took)
  • Tokens (input + output counts)
  • Model parameters (model name, temperature, max_tokens)
import openai
import time
import json

def traced_llm_call(messages, model="gpt-4", temperature=0.7):
    start_time = time.time()

    trace = {
        "timestamp": start_time,
        "type": "llm_call",
        "model": model,
        "temperature": temperature,
        "input": messages
    }

    try:
        response = openai.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature
        )

        trace["output"] = response.choices[0].message.content
        trace["tokens_input"] = response.usage.prompt_tokens
        trace["tokens_output"] = response.usage.completion_tokens
        trace["tokens_total"] = response.usage.total_tokens
        trace["latency_ms"] = (time.time() - start_time) * 1000
        trace["status"] = "success"

        # Save or send this trace to your observability system
        save_trace(trace)

        return response
    except Exception as e:
        trace["status"] = "error"
        trace["error"] = str(e)
        trace["latency_ms"] = (time.time() - start_time) * 1000
        save_trace(trace)
        raise

# Usage
response = traced_llm_call(
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    model="gpt-4",
    temperature=0.7
)

This gives you the raw data you need to debug issues. When a user complains about a response, you can look up the trace and see exactly what prompt was sent and what came back.

Tracing Multi-Step Agents

Real-world applications rarely involve a single LLM call. Agents combine reasoning, tool use, and iteration. Here's where tracing becomes essential.

Consider a LangChain agent that can search the web and perform calculations:

from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
import time

# Without tracing, this is a black box
def ask_agent(question):
    tools = [
        Tool(name="Search", func=search_web),
        Tool(name="Calculator", func=calculate)
    ]

    agent = initialize_agent(tools, OpenAI(temperature=0), verbose=True)
    return agent.run(question)

# With tracing, we see each step
class TracedAgent:
    def __init__(self):
        self.trace_id = generate_trace_id()
        self.spans = []

    def create_span(self, name, parent_id=None):
        span_id = generate_span_id()
        span = {
            "trace_id": self.trace_id,
            "span_id": span_id,
            "parent_id": parent_id,
            "name": name,
            "start_time": time.time(),
            "attributes": {}
        }
        return span_id, span

    def end_span(self, span_id, span, **attributes):
        span["end_time"] = time.time()
        span["duration_ms"] = (span["end_time"] - span["start_time"]) * 1000
        span["attributes"].update(attributes)
        self.spans.append(span)
        save_span(span)

    def run(self, question):
        # Root span
        root_span_id, root_span = self.create_span("agent_execution")

        try:
            # Planning step
            plan_span_id, plan_span = self.create_span("llm_planning", root_span_id)
            plan = self._plan_action(question)
            self.end_span(plan_span_id, plan_span,
                         step="planning",
                         model="gpt-4",
                         tokens=plan.tokens)

            # Tool execution (could be multiple)
            step_num = 1
            while not plan.is_complete:
                tool_span_id, tool_span = self.create_span(
                    f"tool_execution_{step_num}",
                    root_span_id
                )

                result = self._execute_tool(plan.tool_name, plan.tool_input)

                self.end_span(tool_span_id, tool_span,
                            step=step_num,
                            tool=plan.tool_name,
                            input=plan.tool_input,
                            output=result)

                # Reasoning after tool use
                reason_span_id, reason_span = self.create_span(
                    f"llm_reasoning_{step_num}",
                    root_span_id
                )
                plan = self._reason_next_step(result)
                self.end_span(reason_span_id, reason_span,
                            step=step_num,
                            model="gpt-4",
                            tokens=plan.tokens)

                step_num += 1

            # Final response
            final_span_id, final_span = self.create_span("final_response", root_span_id)
            response = self._generate_response(plan.final_answer)
            self.end_span(final_span_id, final_span,
                         model="gpt-4",
                         tokens=response.tokens)

            self.end_span(root_span_id, root_span,
                         status="success",
                         total_steps=step_num)

            return response.text

        except Exception as e:
            self.end_span(root_span_id, root_span,
                         status="error",
                         error=str(e))
            raise

Now when the agent executes, you get a complete picture:

Trace ID: abc123
├─ Span: agent_execution (2.4s)
│  ├─ Span: llm_planning (340ms)
│  │  └─ Attributes: model=gpt-4, tokens=128
│  ├─ Span: tool_execution_1 (120ms)
│  │  └─ Attributes: tool=Search, input="Python tutorials", output="..."
│  ├─ Span: llm_reasoning_1 (280ms)
│  │  └─ Attributes: model=gpt-4, tokens=156
│  ├─ Span: tool_execution_2 (95ms)
│  │  └─ Attributes: tool=Calculator, input="2+2", output="4"
│  ├─ Span: llm_reasoning_2 (310ms)
│  │  └─ Attributes: model=gpt-4, tokens=142
│  └─ Span: final_response (290ms)
│     └─ Attributes: model=gpt-4, tokens=89

This structure makes it obvious if a tool is slow, if the agent is using too many steps, or if one particular LLM call is consuming excessive tokens.

Tracing RAG Pipelines

Retrieval-Augmented Generation (RAG) adds another layer of complexity: you're debugging both retrieval quality and generation quality. A trace helps you isolate which component is failing.

from typing import List, Dict
import time

class TracedRAGPipeline:
    def __init__(self, vector_db, llm):
        self.vector_db = vector_db
        self.llm = llm
        self.trace_id = generate_trace_id()

    def query(self, question: str, top_k: int = 5) -> str:
        root_span_id, root_span = self.create_span("rag_query")

        # 1. Query embedding
        embed_span_id, embed_span = self.create_span("embed_query", root_span_id)
        query_embedding = self._embed(question)
        self.end_span(embed_span_id, embed_span,
                     input_length=len(question),
                     vector_dimensions=len(query_embedding))

        # 2. Vector search
        search_span_id, search_span = self.create_span("vector_search", root_span_id)
        results = self.vector_db.search(query_embedding, top_k=top_k)
        self.end_span(search_span_id, search_span,
                     top_k=top_k,
                     results_returned=len(results),
                     similarity_scores=[r.score for r in results])

        # 3. Reranking (optional but common)
        rerank_span_id, rerank_span = self.create_span("rerank", root_span_id)
        reranked = self._rerank(question, results)
        self.end_span(rerank_span_id, rerank_span,
                     input_count=len(results),
                     output_count=len(reranked))

        # 4. Context assembly
        context_span_id, context_span = self.create_span("build_context", root_span_id)
        context = self._build_context(reranked)
        self.end_span(context_span_id, context_span,
                     context_length=len(context),
                     num_chunks=len(reranked))

        # 5. LLM generation
        gen_span_id, gen_span = self.create_span("llm_generation", root_span_id)
        prompt = self._build_prompt(question, context)
        response = self.llm.generate(prompt)
        self.end_span(gen_span_id, gen_span,
                     model=self.llm.model_name,
                     prompt_tokens=response.usage.prompt_tokens,
                     completion_tokens=response.usage.completion_tokens,
                     context_included=context[:500])  # truncated for storage

        self.end_span(root_span_id, root_span, status="success")

        return response.text

This trace structure lets you answer questions like:

  • "Was the right content retrieved?" (check similarity scores in vector_search span)
  • "Is reranking helping?" (compare before/after in rerank span)
  • "Is the context too long?" (check context_length in build_context span)
  • "Where's the latency bottleneck?" (compare span durations)

Common Debugging Scenarios

Let's walk through four real-world debugging scenarios where traces save the day.

Scenario 1: "Why is this so slow?"

A user reports your RAG chatbot takes 8 seconds to respond. You check the trace:

Trace: User query "What is your refund policy?"
├─ embed_query: 45ms
├─ vector_search: 120ms
├─ rerank: 4,200ms ← BOTTLENECK
├─ build_context: 15ms
└─ llm_generation: 890ms

The reranker is taking 4.2 seconds! You investigate and discover it's a cross-encoder model running on CPU. You switch to a faster model or move it to GPU, reducing the time to 200ms.

Without tracing, you might have optimized the LLM call (the most obvious suspect) and made no meaningful improvement.

Scenario 2: "Why did it hallucinate?"

Your Q&A bot confidently states a fact that's completely wrong. You pull up the trace:

Span: vector_search
  Attributes:
    top_k: 5
    results_returned: 5
    similarity_scores: [0.42, 0.38, 0.35, 0.33, 0.31]

Span: build_context
  Attributes:
    context_length: 1200
    context_included: "... unrelated content about shipping policies ..."

The similarity scores are low (< 0.5), and the context is about shipping, not the question asked. The retrieval failed, so the LLM had to guess. You realize your embedding model doesn't understand domain-specific terminology and needs fine-tuning.

Scenario 3: "Why did costs spike?"

Your monthly bill jumps from $500 to $3,200. You query your traces for high token usage:

SELECT trace_id, SUM(tokens_total) as total_tokens
FROM spans
WHERE timestamp > '2024-01-01'
GROUP BY trace_id
ORDER BY total_tokens DESC
LIMIT 10

You discover traces with 15,000+ tokens each. Looking at one:

Span: llm_generation
  Attributes:
    prompt_tokens: 12,800
    completion_tokens: 450
    context_included: "... [50 full documents] ..."

Your context assembly is including entire documents instead of relevant chunks. A quick fix to limit context to 3,000 tokens cuts costs by 70%.

Scenario 4: "Why did the agent loop infinitely?"

Your agent sometimes runs for minutes and times out. You trace a failing execution:

Trace: User query "Calculate the ROI of this investment"
├─ Step 1: llm_planning → tool=Calculator
├─ Step 2: tool_execution → error: "division by zero"
├─ Step 3: llm_reasoning → tool=Calculator (same input!)
├─ Step 4: tool_execution → error: "division by zero"
├─ Step 5: llm_reasoning → tool=Calculator (same input!)
...
├─ Step 47: llm_reasoning → tool=Calculator (same input!)
└─ TIMEOUT

The agent can't handle the tool error and keeps retrying with the same input. You add error handling to the agent prompt: "If a tool returns an error, try a different approach or ask the user for clarification."

Implementing Tracing: Three Approaches

Now that you understand what tracing captures and why it matters, how do you actually implement it?

Approach 1: Manual Instrumentation

You write the tracing code yourself, as shown in the examples above.

Pros:

  • Full control over what's captured
  • No external dependencies
  • Works with any stack

Cons:

  • Tedious and error-prone
  • Easy to forget to trace something
  • No visualization tools included

When to use: Learning, simple applications, or when you need custom attributes that no library supports.

Approach 2: Framework Auto-Instrumentation

LangChain, LlamaIndex, and other frameworks provide built-in tracing:

# LangChain example
from langchain.callbacks import TraceCallbackHandler

tracer = TraceCallbackHandler(
    project_name="my-agent",
    tags=["production"]
)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    callbacks=[tracer]
)

# Now every step is automatically traced
response = agent.run("What's the weather in Paris?")

Pros:

  • Easy setup (a few lines)
  • Comprehensive coverage of framework operations
  • Community support

Cons:

  • Framework lock-in
  • May capture too much or too little
  • Limited customization

When to use: You're already using one of these frameworks and want quick results.

Approach 3: Observability Platform SDK

Services like OpenTelemetry, LangSmith, or specialized LLM observability platforms provide SDKs:

from llm_observability import trace, init

# One-time initialization
init(api_key="your-key", project="my-app")

# Decorator-based tracing
@trace(name="rag_query")
def query_knowledge_base(question: str) -> str:
    # Your code here - automatically traced
    embedding = embed(question)
    results = search(embedding)
    return generate_response(results)

# Works with any code, any framework
response = query_knowledge_base("What is your refund policy?")

Pros:

  • Works with any code or framework
  • Includes UI for visualization and analysis
  • Production-ready (sampling, retention, alerting)
  • Minimal code changes

Cons:

  • Vendor dependency
  • May send data outside your infrastructure
  • Cost for high-volume applications

When to use: Production applications where you need reliability, team collaboration, and long-term retention.

Tracing Best Practices

As you implement LLM tracing, follow these guidelines to maximize value and minimize overhead.

Essential Trace Attributes Checklist

CategoryWhat to CaptureWhy It Matters
IdentificationRequest ID, timestamp, user/session IDTrack related requests, reproduce issues
Model ConfigModel name, temperature, max_tokens, stop sequencesUnderstand behavior variations
InputsPrompt text (or hash if sensitive)Debug hallucinations, verify context
OutputsResponse text (or hash if sensitive)Validate quality, catch regressions
ResourcesToken counts (input, output, total)Cost attribution, optimization
PerformanceLatency (wall-clock time), span durationsIdentify bottlenecks
StatusSuccess, error, timeout, error messagesReliability monitoring

What NOT to Capture

Privacy and Performance Warnings

- Full documents or proprietary data (use summaries or content hashes)

- PII (personally identifiable information) unless you have explicit consent

- API keys or credentials in attributes

- Massive context windows verbatim (truncate or sample to first/last N tokens)

Sampling strategies: When you're processing millions of requests, storing every trace becomes expensive. Implement sampling:

import random

def should_trace(request) -> bool:
    # Always trace errors
    if request.has_error:
        return True

    # Always trace slow requests
    if request.latency > 5000:  # 5 seconds
        return True

    # Sample 10% of normal requests
    return random.random() < 0.10

Retention policies: Define how long to keep traces. Common approach:

  • Errors: 90 days
  • Slow requests (p99): 30 days
  • Normal requests: 7 days

Hands-On: Add Tracing to a Simple App

Let's take a simple Q&A bot and add comprehensive tracing in about 20 lines of code.

Before (blind debugging):

def answer_question(question: str) -> str:
    print(f"Received question: {question}")

    # Retrieve context
    context = knowledge_base.search(question)
    print(f"Found {len(context)} results")

    # Generate answer
    prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    answer = response.choices[0].message.content
    print(f"Generated answer: {answer[:100]}...")

    return answer

When something goes wrong, you squint at print statements and guess.

After (full visibility):

from llm_observability import trace, span

@trace(name="answer_question")
def answer_question(question: str) -> str:
    trace.set_attribute("question_length", len(question))

    # Retrieve context
    with span("knowledge_base_search"):
        context = knowledge_base.search(question)
        span.set_attribute("results_count", len(context))
        span.set_attribute("avg_similarity", sum(r.score for r in context) / len(context))

    # Generate answer
    with span("llm_generation") as llm_span:
        prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
        llm_span.set_attribute("prompt_length", len(prompt))

        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        llm_span.set_attribute("model", "gpt-4")
        llm_span.set_attribute("prompt_tokens", response.usage.prompt_tokens)
        llm_span.set_attribute("completion_tokens", response.usage.completion_tokens)

    answer = response.choices[0].message.content
    trace.set_attribute("answer_length", len(answer))

    return answer

Now every execution produces a structured trace you can query, filter, and visualize. You can answer questions like:

  • "Show me all requests where similarity was below 0.5"
  • "What's the p95 latency for the llm_generation span?"
  • "Which questions produce the longest prompts?"

Getting Started with LLM Tracing

Tracing transforms LLM debugging from guesswork to science. Start small and expand coverage as you see value:

Your First Week Implementation Plan

Day 1-2: Pick one critical path
  └─ Identify your main query handler or most-used agent workflow

Day 3-4: Add manual instrumentation
  └─ Capture: prompts, responses, tokens, latency, model config

Day 5: Visualize traces
  └─ Log to structured format (JSON) and view in trace viewer

Day 6-7: Expand coverage
  └─ Add tracing to top 3 workflows based on traffic/importance

Week 2+: Production-grade setup
  └─ Consider observability platform for team collaboration

Three Implementation Approaches Compared

ApproachSetup TimeFlexibilityBest For
Manual Instrumentation2-4 hoursHighLearning, simple apps, custom attributes
Framework Auto-Instrumentation30 minutesMediumLangChain/LlamaIndex users, quick wins
Observability Platform1 hourHighProduction apps, team collaboration, long-term retention

Conclusion

The difference between debugging with and without LLM tracing is night and day. With traces, you can confidently answer "what happened?" instead of guessing. You'll find performance bottlenecks, catch quality regressions, and optimize costs with actual data.

Your future self, debugging a production incident at 2 AM, will thank you for implementing tracing today.


Related Articles


Ready to see your traces in action? Start with our 5-minute quick start guide to add comprehensive tracing to your LLM application with just a few lines of code.