Eval Harness
Run a held-out dataset against agent versions to detect regressions and measure improvement.
Problem
When the team relies on intuition or a handful of spot checks, a change that 'feels better' on three examples can quietly regress on the dozens of cases nobody re-ran. Open-ended outputs cannot be checked with simple exact-match assertions, so without a deliberate scoring approach there is no shared yardstick. The team is forced to choose between shipping by feel and reading user complaints, or running ad-hoc one-off comparisons that never accumulate into a baseline.
Solution
Build a golden dataset of (input, expected output) pairs. Run candidate versions against the dataset; score each. Compare champion (current) against challenger (proposed). Promote on quality lift, blocked on regression. Re-run on every meaningful change.
When to use
- A change that 'feels better' is regressing quality silently in your system.
- A golden dataset of (input, expected output) pairs can be constructed.
- Champion-vs-challenger comparison drives promotion decisions.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.
Related
- LLM-as-Judge
- Eval as Contract
- Shadow Canary
- Perma-Beta
- DSPy Signatures
- Agent-as-a-Judge
- Automatic Workflow Search
- Scorer Live Monitoring
- Dual Evaluation (Offline + Online)
- Red-Team Sandbox Reproduction
- Intermediate Artifact Evaluation
- Agent Evaluator
- Bayesian Bandit Experimentation
- Sampled Prompt Trace Eval
- Dimensional Synthetic Eval Set
- Prompt Variant Evaluation