Governance & Observability

Eval Harness

Run a held-out dataset against agent versions to detect regressions and measure improvement.

Problem

When the team relies on intuition or a handful of spot checks, a change that 'feels better' on three examples can quietly regress on the dozens of cases nobody re-ran. Open-ended outputs cannot be checked with simple exact-match assertions, so without a deliberate scoring approach there is no shared yardstick. The team is forced to choose between shipping by feel and reading user complaints, or running ad-hoc one-off comparisons that never accumulate into a baseline.

Solution

Build a golden dataset of (input, expected output) pairs. Run candidate versions against the dataset; score each. Compare champion (current) against challenger (proposed). Promote on quality lift, blocked on regression. Re-run on every meaningful change.

When to use

  • A change that 'feels better' is regressing quality silently in your system.
  • A golden dataset of (input, expected output) pairs can be constructed.
  • Champion-vs-challenger comparison drives promotion decisions.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related