Eval Harness

also known as Golden Dataset Suite, Champion-Challenger, Regression Suite

Run a held-out dataset against agent versions to detect regressions and measure improvement.

This pattern helps complete certain larger patterns —

used-byDSPy Signatures★— Specify agent behaviour as declarative typed signatures and modules; compile prompts and few-shot examples automatically against a metric.
used-byAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
used-byAutomatic Workflow Search·— Treat the agent's workflow (a graph of LLM-invoking nodes) as an artefact to search; use Monte Carlo Tree Search guided by an eval benchmark to discover the best workflow, then deploy it.
used-byBayesian Bandit Experimentation★— Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.
used-by[evaluation-driven-development]
used-byDimensional Synthetic Eval Set★— Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.
used-byPrompt Variant Evaluation★★— Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.

Context

A team is iterating on an agent whose outputs depend on a prompt, a model version, retrieval choices, and tool wiring — none of which is deterministic in the way a normal function is. Small changes anywhere in that stack can shift behaviour in ways that are not obvious from a few hand-tested examples. The team needs a way to compare a proposed version against the current one on a fixed, representative set of inputs.

Problem

When the team relies on intuition or a handful of spot checks, a change that 'feels better' on three examples can quietly regress on the dozens of cases nobody re-ran. Open-ended outputs cannot be checked with simple exact-match assertions, so without a deliberate scoring approach there is no shared yardstick. The team is forced to choose between shipping by feel and reading user complaints, or running ad-hoc one-off comparisons that never accumulate into a baseline.

Forces

Dataset construction is expensive and ages.
Judging open-ended outputs needs a metric or judge.
Champion-challenger is fairer but doubles cost.

Example

A team intuits that switching from one model to another 'feels better' for their RAG agent and pushes the change. Two days later, users complain that summaries are now missing key facts. They build an Eval Harness: a held-out dataset of representative queries, a scoring function for each, and a runner that scores any candidate version. Now changes that 'feel better' get a number; the regression on factual recall would have been visible before deploy.

Diagram

flowchart TD G[Golden dataset] --> Champ[Run champion] G --> Chal[Run challenger] Champ --> Sc1[Score] Chal --> Sc2[Score] Sc1 --> Cmp{Lift vs regression?} Sc2 --> Cmp Cmp -- lift --> Promote[Promote challenger] Cmp -- regression --> Block[Block]

Solution

Therefore:

Build a golden dataset of (input, expected output) pairs. Run candidate versions against the dataset; score each. Compare champion (current) against challenger (proposed). Promote on quality lift, blocked on regression. Re-run on every meaningful change.

What this pattern forbids. Releases are blocked if the harness flags a regression beyond tolerance.

The smaller patterns that complete this one —

usesLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
generalisesEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.

And the patterns that stand alongside it, or against it —

complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
alternative-toPerma-Beta✕— Anti-pattern: ship the agent in 'beta' indefinitely so that quality regressions are someone else's problem.
complementsScorer Live Monitoring★— Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
complementsDual Evaluation (Offline + Online)★— Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
complementsRed-Team Sandbox Reproduction★— Routinely re-reproduce canonical alignment-failure modes inside a sealed sandbox per release; treat the alignment regression suite as a deployment gate.
complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
complementsAgent Evaluator★— A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
complementsSampled Prompt Trace Eval★— Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
alternative-toBehavior-Pinning Test Before Agent Edit★— Capture the current behaviour of agent-touchable code as golden characterization tests before an agent edits it, with load-bearing values computed deterministically and only prose left to the model, run as a regression gate.

Eval Harness

Context

Problem

Forces

Example

Diagram

Solution

Neighbourhood

Used in recipes

Used in frameworks

References

Provenance