Governance & Observability

Sampled Prompt Trace Eval

Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.

Problem

Two failure shapes are common. Run the judge on every trace and the monitoring cost matches or exceeds the production cost; engineering pressure cuts judging quickly. Run no judging and the team relies on offline evals that drift from production distribution; regressions in real traffic are invisible until users complain. Without a sampling discipline, monitoring is either unaffordable or absent.

Solution

Log every production request's prompt, response, retrieved context, model parameters, and metadata to a monitoring store (Opik, LangSmith, Comet). On a configurable sample rate (e.g. 5% uniform plus 50% on enterprise tenants), run the LLM judge against the rubric. Aggregate scores over time windows. Surface drift in dashboards. Sampling rate, weighted slices, and budget are all configuration. Distinct from shadow-canary (which compares two variants) and from offline eval (which uses a frozen set).

When to use

  • Production traffic is large enough that judging every trace is infeasible.
  • Drift detection on real traffic matters.
  • Some slices justify weighted sampling.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related