Sampled Prompt Trace Eval
also known as Sampled Monitoring Eval, Random-Sample LLM-Judge
Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
Context
A production LLM application receives thousands or millions of requests. The team wants production quality metrics — LLM-judge scores on actual traffic, not just on offline eval sets. Running an LLM judge on every request doubles inference cost and is infeasible at scale.
Problem
Two failure shapes are common. Run the judge on every trace and the monitoring cost matches or exceeds the production cost; engineering pressure cuts judging quickly. Run no judging and the team relies on offline evals that drift from production distribution; regressions in real traffic are invisible until users complain. Without a sampling discipline, monitoring is either unaffordable or absent.
Forces
- LLM-judge cost is per-trace; total scales with traffic.
- A representative sample is sufficient to track quality drift over time.
- Sampling rate must be tuned to traffic volume and budget.
- Some slices of traffic (high-value, high-risk) deserve higher sampling than uniform.
Example
A SaaS platform processes 500k LLM requests per day. The team logs every trace to Opik. An LLM judge runs against a faithfulness/answer-quality rubric on 5% uniform plus 50% of enterprise-tier requests. Daily aggregate scores feed a drift dashboard. A regression in faithfulness on the enterprise slice is caught within hours despite the judge running on only ~25k requests.
Diagram
Solution
Therefore:
Log every production request's prompt, response, retrieved context, model parameters, and metadata to a monitoring store (Opik, LangSmith, Comet). On a configurable sample rate (e.g. 5% uniform plus 50% on enterprise tenants), run the LLM judge against the rubric. Aggregate scores over time windows. Surface drift in dashboards. Sampling rate, weighted slices, and budget are all configuration. Distinct from shadow-canary (which compares two variants) and from offline eval (which uses a frozen set).
What this pattern forbids. Production quality monitoring with LLM judges must not run on every trace at scale; the judge runs on a random sample drawn at a documented rate.
The smaller patterns that complete this one —
- usesLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- usesDecision Log★★— Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.
And the patterns that stand alongside it, or against it —
- complementsAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
- complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- complements[evaluation-driven-development]
- complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.