Sampled Prompt Trace Eval

also known as Sampled Monitoring Eval, Random-Sample LLM-Judge

Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.

Context

A production LLM application receives thousands or millions of requests. The team wants production quality metrics — LLM-judge scores on actual traffic, not just on offline eval sets. Running an LLM judge on every request doubles inference cost and is infeasible at scale.

Problem

Two failure shapes are common. Run the judge on every trace and the monitoring cost matches or exceeds the production cost; engineering pressure cuts judging quickly. Run no judging and the team relies on offline evals that drift from production distribution; regressions in real traffic are invisible until users complain. Without a sampling discipline, monitoring is either unaffordable or absent.

Forces

LLM-judge cost is per-trace; total scales with traffic.
A representative sample is sufficient to track quality drift over time.
Sampling rate must be tuned to traffic volume and budget.
Some slices of traffic (high-value, high-risk) deserve higher sampling than uniform.

Example

A SaaS platform processes 500k LLM requests per day. The team logs every trace to Opik. An LLM judge runs against a faithfulness/answer-quality rubric on 5% uniform plus 50% of enterprise-tier requests. Daily aggregate scores feed a drift dashboard. A regression in faithfulness on the enterprise slice is caught within hours despite the judge running on only ~25k requests.

Diagram

flowchart LR Req[Production request] --> Inf[Inference] Inf --> Log[Trace log: every request] Log --> Samp[Sample: 5% uniform + slices] Samp --> Judge[LLM judge] Judge --> Agg[Aggregate dashboard] Agg --> Op[Operator]

Solution

Therefore:

Log every production request's prompt, response, retrieved context, model parameters, and metadata to a monitoring store (Opik, LangSmith, Comet). On a configurable sample rate (e.g. 5% uniform plus 50% on enterprise tenants), run the LLM judge against the rubric. Aggregate scores over time windows. Surface drift in dashboards. Sampling rate, weighted slices, and budget are all configuration. Distinct from shadow-canary (which compares two variants) and from offline eval (which uses a frozen set).

What this pattern forbids. Production quality monitoring with LLM judges must not run on every trace at scale; the judge runs on a random sample drawn at a documented rate.

The smaller patterns that complete this one —

usesLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
usesDecision Log★★— Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.

And the patterns that stand alongside it, or against it —

complementsAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
complements[evaluation-driven-development]
complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Production LLM Platform
hardening
Bounded-cost production-quality monitoring via random + slice-weighted sampling.

Used in frameworks

Langfuse
first-class5 patternsEnterprise Platforms★★ mature
Langfuse ingests full production traces and lets you attach LLM-as-a-judge evaluators that run on a configurable sampling percentage of traces so judge cost stays bounded as traff…

References

Provenance

Source: patterns/sampled-prompt-trace-eval.md on GitHub · commit 135ae3c · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.