Skip to content

Observability

Prereq: Cost & Latency. Observability is what tells you whether your cost optimizations also broke something.

When you wrap a request handler in with tracer.start_as_current_span("answer_user_query") as span: span.set_attribute("user_id", uid), the next thing the framework hands you is a tree of nested spans — one for the retrieval call, one for the small-model planner, one for the big-model fallback, one for each tool invocation — each with start/end times, inputs, outputs, and arbitrary metadata. That tree is a trace. Stream it to Langfuse or Phoenix and the same tree becomes a row in a database, a flame chart in the UI, a per-user cost dashboard, and the input to an async judge that scores it on faithfulness 30 seconds later. Stdout-printing your prompts isn’t observability; spans, token counts, and per-trace eval scores are.

The single most common production-LLM failure in 2024–2026 is silent quality regression — a routing change, an FP8 KV cache rollout, or a prompt update lands; cost drops 40%; the offline eval doesn’t move; and three weeks later support tickets reveal the model started giving wrong answers on a specific class of queries. Without traces, online evals on live traffic, and a regression alarm, you wouldn’t have caught it. The previous lesson on cost levers is unsafe to ship without this one — every lever it teaches is a regression risk, and observability is the harness that makes them survivable.

TL;DR

  • LLM systems need three layers of observation that map cleanly to web-app analogues: traces (every input, every tool call, every token, every output), online evals (a judge that scores live traffic and produces a quality time-series), and regression detection (alerts when a deploy moved a metric).
  • Langfuse, Phoenix, and LangSmith are the 2026 production reference stacks. Langfuse and Phoenix are OSS; LangSmith is paid but tightest with LangChain. They all model the same OpenTelemetry-style trace tree: trace → span → tool/llm call → tokens.
  • Online evals are not unit tests. Unit tests run on a fixed dataset; online evals score real user traffic, surface drift, and feed regressions into your CI loop.
  • Five metrics to track from day one: latency (TTFT + TPOT), cost ($/request), trace volume, fail rate (parse errors, refusals), and a single quality score from your judge model. Everything else is derivative.
  • Cost-down without observability = quality-down silently. Every lever from the previous lesson is a regression risk; you cannot ship them safely without the harness on.

Mental model

The trace tree is the substrate. Everything else — dashboards, evals, alerts, replay — is a query against it.

Layer 1 — Tracing

A trace is a tree of spans. The root is one user request. Children are model calls, tool calls, retrieval calls, sub-LLM calls. Every span has: a name, a start/end time, an input, an output, a status (ok/error), and arbitrary metadata (model name, sampling params, tokens, cost).

The wire format is OpenTelemetry-compatible. Every modern stack (Langfuse, Phoenix, LangSmith) accepts OTel; many production teams emit to multiple sinks at once.

# Langfuse-style instrumentation. The decorator captures inputs/outputs/timing. from langfuse import observe @observe(name="answer_user_query") def answer(query: str) -> str: docs = retrieve(query) # auto-traced as a child span plan = small_model.plan(query, docs) # ditto if confidence(plan) > 0.7: return small_model.answer(plan) return big_model.answer(plan) # different span, different model # All four function calls show up as nested spans in one trace.

In production what matters is what you put in metadata. Recommended fields:

FieldWhy
user_idPer-user quality and cost dashboards.
session_idMulti-turn replay.
prompt_versionWhich template was active.
modelWhich model answered (small or big in routed setups).
tokens_in/outCost reconciliation.
prefix_hitDid APC hit? Per-trace view of prefix caching.
eval_scoreFilled in async by the judge.

Skip user_message if it’s PII-sensitive; redact server-side before emitting.

Layer 2 — Online evals

A judge — usually a small LLM, sometimes a finetuned classifier — scores live traces in the background. Score schemas you’ll actually use:

ScoreWhat it measuresHow
FaithfulnessDoes the answer match the retrieved docs?Judge prompt + RAG context
HelpfulnessDoes it answer the user’s actual question?Judge prompt
Refusal correctnessDid we refuse correctly (not over-refuse)?Judge prompt with task definition
Format complianceDid it emit valid JSON / call the right tool?Programmatic check (no LLM needed)
Latency p99, distributionsFrom the trace
User signal (sparse)👍/👎 from the UIDirect user feedback

The killer move: your evals run on every production trace, asynchronously, and produce a time-series. You don’t run them only at deploy time on a frozen dataset — you run them continuously on the live traffic distribution.

# Async judge — runs in a worker, scores last 5 minutes of traces. def judge_recent(window_min=5): for trace in langfuse.get_traces(since=now() - window_min * 60): if trace.metadata.get("eval_score") is not None: continue score = judge_model.complete( "Score 1-5: Did the answer faithfully address the question?", context=trace.input, answer=trace.output, ) langfuse.update(trace.id, eval_score=parse_score(score))

Score histograms over time are what you watch. A 0.2-point drop in faithfulness over a 24-hour window is your alert.

Layer 3 — Regression detection

Three forms, in increasing rigor:

  1. Threshold alerts. “Faithfulness below 4.2 for over 30 min → page.” Cheap, noisy, often disabled within a week.
  2. Time-series anomaly detection. Compare current 1h window to last 7-day distribution. Slack alert on greater-than-2σ drift. Better signal-to-noise.
  3. A/B with shadow traffic. New deploy gets 5% of traffic; eval scores compared to control’s 95%. Statistically meaningful regressions get auto-rolled-back.

Most teams do (1) for the first 6 months, then upgrade to (2). (3) is the “we know what we’re doing” final state — typically required before any team feels safe shipping daily.

What goes on the dashboard

Five panels, in order of importance:

  1. TTFT and TPOT distributions (P50, P99) over time, broken out by model.
  2. Cost per 1K requests rolling 24h.
  3. Quality score (from the judge) — line chart over time, with deploy markers.
  4. Failure rate — refusals, parse errors, judge-marked low-quality.
  5. Top-N slow traces of the last hour, with deep-link to the trace tree.

Anything else is optional. These five are what stops a 3 AM page becoming a 3-day incident.

Replay and prompt iteration

The single largest productivity multiplier on an LLM team is “I can grab any trace, edit the prompt or the routing decision, and re-run it instantly.” Both Langfuse and Phoenix ship a “Playground” / “Prompts” view that does exactly this.

The discipline:

  1. Find a failed trace via your dashboard or grep.
  2. Open it; see inputs, intermediates, outputs.
  3. Hit “Replay” with a candidate prompt change.
  4. If output is better, save the new prompt version. The trace is automatically tagged with prompt_version.
  5. Roll out to canary; watch the eval score for that prompt version.

This is the loop. Teams that don’t have this loop spend their iteration budget on theory; teams that do iterate on data.

Run it in your browser — a tiny tracer + judge

Python — editableA 60-line in-memory tracing stack. Spans, child spans, async judge, and a P99 calculation.
Ctrl+Enter to run

The shape is what matters: a tracer captures spans; an aggregator turns them into P50/P99; a judge scores the leaves. Production systems are this with persistence, sampling, and a UI on top.

Quick check

Fill in the blank
The single highest-leverage observability practice — score *what* asynchronously?
Not test-set evals. Not deploy-time evals. The traffic that's actually happening right now.
Quick check
A team rolled out FP8 KV cache and saw cost drop 40%. They're celebrating. What's their highest-leverage *next* action?

Key takeaways

  1. Three layers: traces, online evals, regression alerts. Each has an OSS reference (Langfuse / Phoenix) — pick one and instrument from day one.
  2. Online evals run on live traffic, not test sets. That’s the loop that catches drift.
  3. Five-panel dashboard: TTFT/TPOT, $/1K requests, quality score, fail rate, top-N slow traces. Everything else is optional.
  4. Replay loop is the productivity multiplier. “Find a failed trace → tweak prompt → re-run” is the daily-driver workflow of any team that ships fast.
  5. Cost-down without quality-up is a trap. Every lever from the previous lesson is a regression risk; observability is what makes them safe.

Go deeper

Prereq: Cost & Latency. Observability is what tells you whether your cost optimizations also broke something.

TL;DR

  • LLM systems need three layers of observation that map cleanly to web-app analogues: traces (every input, every tool call, every token, every output), online evals (a judge that scores live traffic and produces a quality time-series), and regression detection (alerts when a deploy moved a metric).
  • Langfuse, Phoenix, and LangSmith are the 2026 production reference stacks. Langfuse and Phoenix are OSS; LangSmith is paid but tightest with LangChain. They all model the same OpenTelemetry-style trace tree: trace → span → tool/llm call → tokens.
  • Online evals are not unit tests. Unit tests run on a fixed dataset; online evals score real user traffic, surface drift, and feed regressions into your CI loop.
  • Five metrics to track from day one: latency (TTFT + TPOT), cost ($/request), trace volume, fail rate (parse errors, refusals), and a single quality score from your judge model. Everything else is derivative.
  • Cost-down without observability = quality-down silently. Every lever from the previous lesson is a regression risk; you cannot ship them safely without the harness on.

Why this matters

The single most common production-LLM failure mode in 2024–2026 is silent quality regression — a routing change, a quantization swap, a prompt update, or a model upgrade lands, average latency improves, the eval doesn’t move, and three weeks later support tickets reveal the model started giving wrong answers on a specific class of queries. Without traces, online evals, and a regression alarm, you wouldn’t have caught it.

Observability is also what unlocks every “we should iterate on the prompt” conversation. You need to be able to grep across last week’s traffic, find the failed conversations, replay them, and verify your fix in seconds. A team that can do this iterates 10× faster than one that can’t.

Mental model

The trace tree is the substrate. Everything else — dashboards, evals, alerts, replay — is a query against it.

Concrete walkthrough

Layer 1 — Tracing

A trace is a tree of spans. The root is one user request. Children are model calls, tool calls, retrieval calls, sub-LLM calls. Every span has: a name, a start/end time, an input, an output, a status (ok/error), and arbitrary metadata (model name, sampling params, tokens, cost).

The wire format is OpenTelemetry-compatible. Every modern stack (Langfuse, Phoenix, LangSmith) accepts OTel; many production teams emit to multiple sinks at once.

# Langfuse-style instrumentation. The decorator captures inputs/outputs/timing. from langfuse import observe @observe(name="answer_user_query") def answer(query: str) -> str: docs = retrieve(query) # auto-traced as a child span plan = small_model.plan(query, docs) # ditto if confidence(plan) > 0.7: return small_model.answer(plan) return big_model.answer(plan) # different span, different model # All four function calls show up as nested spans in one trace.

In production what matters is what you put in metadata. Recommended fields:

FieldWhy
user_idPer-user quality and cost dashboards.
session_idMulti-turn replay.
prompt_versionWhich template was active.
modelWhich model answered (small or big in routed setups).
tokens_in/outCost reconciliation.
prefix_hitDid APC hit? Per-trace view of prefix caching.
eval_scoreFilled in async by the judge.

Skip user_message if it’s PII-sensitive; redact server-side before emitting.

Layer 2 — Online evals

A judge — usually a small LLM, sometimes a finetuned classifier — scores live traces in the background. Score schemas you’ll actually use:

ScoreWhat it measuresHow
FaithfulnessDoes the answer match the retrieved docs?Judge prompt + RAG context
HelpfulnessDoes it answer the user’s actual question?Judge prompt
Refusal correctnessDid we refuse correctly (not over-refuse)?Judge prompt with task definition
Format complianceDid it emit valid JSON / call the right tool?Programmatic check (no LLM needed)
Latency p99TPOT, TTFT distributionsFrom the trace
User signal (sparse)👍/👎 from the UIDirect user feedback

The killer move: your evals run on every production trace, asynchronously, and produce a time-series. You don’t run them only at deploy time on a frozen dataset — you run them continuously on the live traffic distribution.

# Async judge — runs in a worker, scores last 5 minutes of traces. def judge_recent(window_min=5): for trace in langfuse.get_traces(since=now() - window_min * 60): if trace.metadata.get("eval_score") is not None: continue score = judge_model.complete( "Score 1-5: Did the answer faithfully address the question?", context=trace.input, answer=trace.output, ) langfuse.update(trace.id, eval_score=parse_score(score))

Score histograms over time are what you watch. A 0.2-point drop in faithfulness over a 24-hour window is your alert.

Layer 3 — Regression detection

Three forms, in increasing rigor:

  1. Threshold alerts. “Faithfulness below 4.2 for over 30 min → page.” Cheap, noisy, often disabled within a week.
  2. Time-series anomaly detection. Compare current 1h window to last 7-day distribution. Slack alert on greater-than-2σ drift. Better signal-to-noise.
  3. A/B with shadow traffic. New deploy gets 5% of traffic; eval scores compared to control’s 95%. Statistically meaningful regressions get auto-rolled-back.

Most teams do (1) for the first 6 months, then upgrade to (2). (3) is the “we know what we’re doing” final state — typically required before any team feels safe shipping daily.

What goes on the dashboard

Five panels, in order of importance:

  1. TTFT and TPOT distributions (P50, P99) over time, broken out by model.
  2. Cost per 1K requests rolling 24h.
  3. Quality score (from the judge) — line chart over time, with deploy markers.
  4. Failure rate — refusals, parse errors, judge-marked low-quality.
  5. Top-N slow traces of the last hour, with deep-link to the trace tree.

Anything else is optional. These five are what stops a 3 AM page becoming a 3-day incident.

Replay and prompt iteration

The single largest productivity multiplier on an LLM team is “I can grab any trace, edit the prompt or the routing decision, and re-run it instantly.” Both Langfuse and Phoenix ship a “Playground” / “Prompts” view that does exactly this.

The discipline:

  1. Find a failed trace via your dashboard or grep.
  2. Open it; see inputs, intermediates, outputs.
  3. Hit “Replay” with a candidate prompt change.
  4. If output is better, save the new prompt version. The trace is automatically tagged with prompt_version.
  5. Roll out to canary; watch the eval score for that prompt version.

This is the loop. Teams that don’t have this loop spend their iteration budget on theory; teams that do iterate on data.

Run it in your browser — a tiny tracer + judge

Python — editableA 60-line in-memory tracing stack. Spans, child spans, async judge, and a P99 calculation.
Ctrl+Enter to run

The shape is what matters: a tracer captures spans; an aggregator turns them into P50/P99; a judge scores the leaves. Production systems are this with persistence, sampling, and a UI on top.

Quick check

Fill in the blank
The single highest-leverage observability practice — score *what* asynchronously?
Not test-set evals. Not deploy-time evals. The traffic that's actually happening right now.
Quick check
A team rolled out FP8 KV cache and saw cost drop 40%. They're celebrating. What's their highest-leverage *next* action?

Key takeaways

  1. Three layers: traces, online evals, regression alerts. Each has an OSS reference (Langfuse / Phoenix) — pick one and instrument from day one.
  2. Online evals run on live traffic, not test sets. That’s the loop that catches drift.
  3. Five-panel dashboard: TTFT/TPOT, $/1K requests, quality score, fail rate, top-N slow traces. Everything else is optional.
  4. Replay loop is the productivity multiplier. “Find a failed trace → tweak prompt → re-run” is the daily-driver workflow of any team that ships fast.
  5. Cost-down without quality-up is a trap. Every lever from the previous lesson is a regression risk; observability is what makes them safe.

Go deeper