Observability
Prereq: Cost & Latency. Observability is what tells you whether your cost optimizations also broke something.
When you wrap a request handler in with tracer.start_as_current_span("answer_user_query") as span: span.set_attribute("user_id", uid), the next thing the framework hands you is a tree of nested spans — one for the retrieval call, one for the small-model planner, one for the big-model fallback, one for each tool invocation — each with start/end times, inputs, outputs, and arbitrary metadata. That tree is a trace. Stream it to Langfuse or Phoenix and the same tree becomes a row in a database, a flame chart in the UI, a per-user cost dashboard, and the input to an async judge that scores it on faithfulness 30 seconds later. Stdout-printing your prompts isn’t observability; spans, token counts, and per-trace eval scores are.
The single most common production-LLM failure in 2024–2026 is silent quality regression — a routing change, an FP8 KV cache rollout, or a prompt update lands; cost drops 40%; the offline eval doesn’t move; and three weeks later support tickets reveal the model started giving wrong answers on a specific class of queries. Without traces, online evals on live traffic, and a regression alarm, you wouldn’t have caught it. The previous lesson on cost levers is unsafe to ship without this one — every lever it teaches is a regression risk, and observability is the harness that makes them survivable.
TL;DR
- LLM systems need three layers of observation that map cleanly to web-app analogues: traces (every input, every tool call, every token, every output), online evals (a judge that scores live traffic and produces a quality time-series), and regression detection (alerts when a deploy moved a metric).
- Langfuse, Phoenix, and LangSmith are the 2026 production reference stacks. Langfuse and Phoenix are OSS; LangSmith is paid but tightest with LangChain. They all model the same OpenTelemetry-style trace tree: trace → span → tool/llm call → tokens.
- Online evals are not unit tests. Unit tests run on a fixed dataset; online evals score real user traffic, surface drift, and feed regressions into your CI loop.
- Five metrics to track from day one: latency (TTFT + TPOT), cost ($/request), trace volume, fail rate (parse errors, refusals), and a single quality score from your judge model. Everything else is derivative.
- Cost-down without observability = quality-down silently. Every lever from the previous lesson is a regression risk; you cannot ship them safely without the harness on.
Mental model
The trace tree is the substrate. Everything else — dashboards, evals, alerts, replay — is a query against it.
Layer 1 — Tracing
A trace is a tree of spans. The root is one user request. Children are model calls, tool calls, retrieval calls, sub-LLM calls. Every span has: a name, a start/end time, an input, an output, a status (ok/error), and arbitrary metadata (model name, sampling params, tokens, cost).
The wire format is OpenTelemetry-compatible. Every modern stack (Langfuse, Phoenix, LangSmith) accepts OTel; many production teams emit to multiple sinks at once.
# Langfuse-style instrumentation. The decorator captures inputs/outputs/timing.
from langfuse import observe
@observe(name="answer_user_query")
def answer(query: str) -> str:
docs = retrieve(query) # auto-traced as a child span
plan = small_model.plan(query, docs) # ditto
if confidence(plan) > 0.7:
return small_model.answer(plan)
return big_model.answer(plan) # different span, different model
# All four function calls show up as nested spans in one trace.In production what matters is what you put in metadata. Recommended fields:
| Field | Why |
|---|---|
user_id | Per-user quality and cost dashboards. |
session_id | Multi-turn replay. |
prompt_version | Which template was active. |
model | Which model answered (small or big in routed setups). |
tokens_in/out | Cost reconciliation. |
prefix_hit | Did APC hit? Per-trace view of prefix caching. |
eval_score | Filled in async by the judge. |
Skip user_message if it’s PII-sensitive; redact server-side before emitting.
Layer 2 — Online evals
A judge — usually a small LLM, sometimes a finetuned classifier — scores live traces in the background. Score schemas you’ll actually use:
| Score | What it measures | How |
|---|---|---|
| Faithfulness | Does the answer match the retrieved docs? | Judge prompt + RAG context |
| Helpfulness | Does it answer the user’s actual question? | Judge prompt |
| Refusal correctness | Did we refuse correctly (not over-refuse)? | Judge prompt with task definition |
| Format compliance | Did it emit valid JSON / call the right tool? | Programmatic check (no LLM needed) |
| Latency p99 | , distributions | From the trace |
| User signal (sparse) | 👍/👎 from the UI | Direct user feedback |
The killer move: your evals run on every production trace, asynchronously, and produce a time-series. You don’t run them only at deploy time on a frozen dataset — you run them continuously on the live traffic distribution.
# Async judge — runs in a worker, scores last 5 minutes of traces.
def judge_recent(window_min=5):
for trace in langfuse.get_traces(since=now() - window_min * 60):
if trace.metadata.get("eval_score") is not None: continue
score = judge_model.complete(
"Score 1-5: Did the answer faithfully address the question?",
context=trace.input, answer=trace.output,
)
langfuse.update(trace.id, eval_score=parse_score(score))Score histograms over time are what you watch. A 0.2-point drop in faithfulness over a 24-hour window is your alert.
Layer 3 — Regression detection
Three forms, in increasing rigor:
- Threshold alerts. “Faithfulness below 4.2 for over 30 min → page.” Cheap, noisy, often disabled within a week.
- Time-series anomaly detection. Compare current 1h window to last 7-day distribution. Slack alert on greater-than-2σ drift. Better signal-to-noise.
- A/B with shadow traffic. New deploy gets 5% of traffic; eval scores compared to control’s 95%. Statistically meaningful regressions get auto-rolled-back.
Most teams do (1) for the first 6 months, then upgrade to (2). (3) is the “we know what we’re doing” final state — typically required before any team feels safe shipping daily.
What goes on the dashboard
Five panels, in order of importance:
- TTFT and TPOT distributions (P50, P99) over time, broken out by model.
- Cost per 1K requests rolling 24h.
- Quality score (from the judge) — line chart over time, with deploy markers.
- Failure rate — refusals, parse errors, judge-marked low-quality.
- Top-N slow traces of the last hour, with deep-link to the trace tree.
Anything else is optional. These five are what stops a 3 AM page becoming a 3-day incident.
Replay and prompt iteration
The single largest productivity multiplier on an LLM team is “I can grab any trace, edit the prompt or the routing decision, and re-run it instantly.” Both Langfuse and Phoenix ship a “Playground” / “Prompts” view that does exactly this.
The discipline:
- Find a failed trace via your dashboard or grep.
- Open it; see inputs, intermediates, outputs.
- Hit “Replay” with a candidate prompt change.
- If output is better, save the new prompt version. The trace is automatically tagged with
prompt_version. - Roll out to canary; watch the eval score for that prompt version.
This is the loop. Teams that don’t have this loop spend their iteration budget on theory; teams that do iterate on data.
Run it in your browser — a tiny tracer + judge
The shape is what matters: a tracer captures spans; an aggregator turns them into P50/P99; a judge scores the leaves. Production systems are this with persistence, sampling, and a UI on top.
Quick check
Key takeaways
- Three layers: traces, online evals, regression alerts. Each has an OSS reference (Langfuse / Phoenix) — pick one and instrument from day one.
- Online evals run on live traffic, not test sets. That’s the loop that catches drift.
- Five-panel dashboard: TTFT/TPOT, $/1K requests, quality score, fail rate, top-N slow traces. Everything else is optional.
- Replay loop is the productivity multiplier. “Find a failed trace → tweak prompt → re-run” is the daily-driver workflow of any team that ships fast.
- Cost-down without quality-up is a trap. Every lever from the previous lesson is a regression risk; observability is what makes them safe.
Go deeper
- DocsLangfuse DocumentationOSS LLM observability stack. The data model section explains the trace tree exactly. Self-host or cloud.
- DocsArize Phoenix DocumentationOSS, OpenTelemetry-native. Stronger evals story than Langfuse; weaker self-hosting story.
- DocsLangSmith DocumentationPaid, LangChain-tightest. Best replay/playground UI in 2026; vendor lock-in is real.
- PaperSWE-bench: Can LM Resolve Real-World GitHub Issues?The eval-on-real-traffic philosophy applied to coding. Read for the eval design intuition.
- BlogHamel Husain — Your AI Product Needs EvalsThe clearest practitioner essay on online vs offline evals. Required reading.
- BlogEugene Yan — LLM PatternsSix high-leverage patterns including evals; the regression-loop section is gold.
- Repolangfuse/langfuseReference implementation. Read `web/src/server/api/routers/traces.ts` for the data model and `worker/src/queues/` for the async judge.
- RepoArize-ai/phoenixPhoenix internals; particularly the OpenInference span semantics.
Prereq: Cost & Latency. Observability is what tells you whether your cost optimizations also broke something.
TL;DR
- LLM systems need three layers of observation that map cleanly to web-app analogues: traces (every input, every tool call, every token, every output), online evals (a judge that scores live traffic and produces a quality time-series), and regression detection (alerts when a deploy moved a metric).
- Langfuse, Phoenix, and LangSmith are the 2026 production reference stacks. Langfuse and Phoenix are OSS; LangSmith is paid but tightest with LangChain. They all model the same OpenTelemetry-style trace tree: trace → span → tool/llm call → tokens.
- Online evals are not unit tests. Unit tests run on a fixed dataset; online evals score real user traffic, surface drift, and feed regressions into your CI loop.
- Five metrics to track from day one: latency (TTFT + TPOT), cost ($/request), trace volume, fail rate (parse errors, refusals), and a single quality score from your judge model. Everything else is derivative.
- Cost-down without observability = quality-down silently. Every lever from the previous lesson is a regression risk; you cannot ship them safely without the harness on.
Why this matters
The single most common production-LLM failure mode in 2024–2026 is silent quality regression — a routing change, a quantization swap, a prompt update, or a model upgrade lands, average latency improves, the eval doesn’t move, and three weeks later support tickets reveal the model started giving wrong answers on a specific class of queries. Without traces, online evals, and a regression alarm, you wouldn’t have caught it.
Observability is also what unlocks every “we should iterate on the prompt” conversation. You need to be able to grep across last week’s traffic, find the failed conversations, replay them, and verify your fix in seconds. A team that can do this iterates 10× faster than one that can’t.
Mental model
The trace tree is the substrate. Everything else — dashboards, evals, alerts, replay — is a query against it.
Concrete walkthrough
Layer 1 — Tracing
A trace is a tree of spans. The root is one user request. Children are model calls, tool calls, retrieval calls, sub-LLM calls. Every span has: a name, a start/end time, an input, an output, a status (ok/error), and arbitrary metadata (model name, sampling params, tokens, cost).
The wire format is OpenTelemetry-compatible. Every modern stack (Langfuse, Phoenix, LangSmith) accepts OTel; many production teams emit to multiple sinks at once.
# Langfuse-style instrumentation. The decorator captures inputs/outputs/timing.
from langfuse import observe
@observe(name="answer_user_query")
def answer(query: str) -> str:
docs = retrieve(query) # auto-traced as a child span
plan = small_model.plan(query, docs) # ditto
if confidence(plan) > 0.7:
return small_model.answer(plan)
return big_model.answer(plan) # different span, different model
# All four function calls show up as nested spans in one trace.In production what matters is what you put in metadata. Recommended fields:
| Field | Why |
|---|---|
user_id | Per-user quality and cost dashboards. |
session_id | Multi-turn replay. |
prompt_version | Which template was active. |
model | Which model answered (small or big in routed setups). |
tokens_in/out | Cost reconciliation. |
prefix_hit | Did APC hit? Per-trace view of prefix caching. |
eval_score | Filled in async by the judge. |
Skip user_message if it’s PII-sensitive; redact server-side before emitting.
Layer 2 — Online evals
A judge — usually a small LLM, sometimes a finetuned classifier — scores live traces in the background. Score schemas you’ll actually use:
| Score | What it measures | How |
|---|---|---|
| Faithfulness | Does the answer match the retrieved docs? | Judge prompt + RAG context |
| Helpfulness | Does it answer the user’s actual question? | Judge prompt |
| Refusal correctness | Did we refuse correctly (not over-refuse)? | Judge prompt with task definition |
| Format compliance | Did it emit valid JSON / call the right tool? | Programmatic check (no LLM needed) |
| Latency p99 | TPOT, TTFT distributions | From the trace |
| User signal (sparse) | 👍/👎 from the UI | Direct user feedback |
The killer move: your evals run on every production trace, asynchronously, and produce a time-series. You don’t run them only at deploy time on a frozen dataset — you run them continuously on the live traffic distribution.
# Async judge — runs in a worker, scores last 5 minutes of traces.
def judge_recent(window_min=5):
for trace in langfuse.get_traces(since=now() - window_min * 60):
if trace.metadata.get("eval_score") is not None: continue
score = judge_model.complete(
"Score 1-5: Did the answer faithfully address the question?",
context=trace.input, answer=trace.output,
)
langfuse.update(trace.id, eval_score=parse_score(score))Score histograms over time are what you watch. A 0.2-point drop in faithfulness over a 24-hour window is your alert.
Layer 3 — Regression detection
Three forms, in increasing rigor:
- Threshold alerts. “Faithfulness below 4.2 for over 30 min → page.” Cheap, noisy, often disabled within a week.
- Time-series anomaly detection. Compare current 1h window to last 7-day distribution. Slack alert on greater-than-2σ drift. Better signal-to-noise.
- A/B with shadow traffic. New deploy gets 5% of traffic; eval scores compared to control’s 95%. Statistically meaningful regressions get auto-rolled-back.
Most teams do (1) for the first 6 months, then upgrade to (2). (3) is the “we know what we’re doing” final state — typically required before any team feels safe shipping daily.
What goes on the dashboard
Five panels, in order of importance:
- TTFT and TPOT distributions (P50, P99) over time, broken out by model.
- Cost per 1K requests rolling 24h.
- Quality score (from the judge) — line chart over time, with deploy markers.
- Failure rate — refusals, parse errors, judge-marked low-quality.
- Top-N slow traces of the last hour, with deep-link to the trace tree.
Anything else is optional. These five are what stops a 3 AM page becoming a 3-day incident.
Replay and prompt iteration
The single largest productivity multiplier on an LLM team is “I can grab any trace, edit the prompt or the routing decision, and re-run it instantly.” Both Langfuse and Phoenix ship a “Playground” / “Prompts” view that does exactly this.
The discipline:
- Find a failed trace via your dashboard or grep.
- Open it; see inputs, intermediates, outputs.
- Hit “Replay” with a candidate prompt change.
- If output is better, save the new prompt version. The trace is automatically tagged with
prompt_version. - Roll out to canary; watch the eval score for that prompt version.
This is the loop. Teams that don’t have this loop spend their iteration budget on theory; teams that do iterate on data.
Run it in your browser — a tiny tracer + judge
The shape is what matters: a tracer captures spans; an aggregator turns them into P50/P99; a judge scores the leaves. Production systems are this with persistence, sampling, and a UI on top.
Quick check
Key takeaways
- Three layers: traces, online evals, regression alerts. Each has an OSS reference (Langfuse / Phoenix) — pick one and instrument from day one.
- Online evals run on live traffic, not test sets. That’s the loop that catches drift.
- Five-panel dashboard: TTFT/TPOT, $/1K requests, quality score, fail rate, top-N slow traces. Everything else is optional.
- Replay loop is the productivity multiplier. “Find a failed trace → tweak prompt → re-run” is the daily-driver workflow of any team that ships fast.
- Cost-down without quality-up is a trap. Every lever from the previous lesson is a regression risk; observability is what makes them safe.
Go deeper
- DocsLangfuse DocumentationOSS LLM observability stack. The data model section explains the trace tree exactly. Self-host or cloud.
- DocsArize Phoenix DocumentationOSS, OpenTelemetry-native. Stronger evals story than Langfuse; weaker self-hosting story.
- DocsLangSmith DocumentationPaid, LangChain-tightest. Best replay/playground UI in 2026; vendor lock-in is real.
- PaperSWE-bench: Can LM Resolve Real-World GitHub Issues?The eval-on-real-traffic philosophy applied to coding. Read for the eval design intuition.
- BlogHamel Husain — Your AI Product Needs EvalsThe clearest practitioner essay on online vs offline evals. Required reading.
- BlogEugene Yan — LLM PatternsSix high-leverage patterns including evals; the regression-loop section is gold.
- Repolangfuse/langfuseReference implementation. Read `web/src/server/api/routers/traces.ts` for the data model and `worker/src/queues/` for the async judge.
- RepoArize-ai/phoenixPhoenix internals; particularly the OpenInference span semantics.