Eval as Contract
also known as Test-Driven Agent, Eval-Gated Release
Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
This pattern helps complete certain larger patterns —
- specialisesEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- used-byPrompt Versioning★★— Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
- used-byRigor Relocation★— Relocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.
Context
A team ships an agent to real users and is expected to keep a stable quality bar release after release. They have an evaluation suite — a held-out set of inputs paired with expected outputs or rubric checks — that already gives them a numeric read on quality. Stakeholders such as product, customers, and compliance depend on that bar holding from one release to the next.
Problem
If the eval suite is something the team runs by hand and looks at when they remember to, regressions slip through silently: a prompt tweak goes out on Tuesday, the eval suite is not run, and by Thursday quality has dropped without anyone noticing. The suite turns into aspirational documentation rather than an actual constraint on releases. The team is forced to choose between trusting vibes between deploys or treating the eval suite the way they would treat a failing unit test.
Forces
- Contract authoring is up-front work.
- Eval-suite drift if not maintained.
- Calibration: which evals are blocking, which are advisory.
Example
A team improves their support agent's planning prompt and ships the change on a Tuesday. By Thursday, the agent's tool-selection accuracy on three known regressions has dropped, but no one notices because there's no gate. They adopt Eval-as-Contract: the held-out eval suite is treated as the release contract — every PR runs it, and any regression below threshold blocks the deploy. The eval suite stops being optional documentation and starts being the thing the agent has to satisfy.
Diagram
Solution
Therefore:
Define a tiered eval suite: blocking evals (must pass for release), advisory evals (tracked but not blocking). Wire blocking evals into CI. Block PRs and releases when blocking evals fail. Treat eval changes as architectural changes (review, signoff).
What this pattern forbids. Releases are forbidden when blocking evals fail; bypassing requires explicit operator override.
And the patterns that stand alongside it, or against it —
- complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
- conflicts-withPerma-Beta✕— Anti-pattern: ship the agent in 'beta' indefinitely so that quality regressions are someone else's problem.
- complementsAutomatic Workflow Search·— Treat the agent's workflow (a graph of LLM-invoking nodes) as an artefact to search; use Monte Carlo Tree Search guided by an eval benchmark to discover the best workflow, then deploy it.
- alternative-toDemo-to-Production Cliff✕— Anti-pattern: ship a demo-validated agent straight into production without a frozen eval, cost ceiling, loop-detector, or named oncall, then act surprised when accuracy drops and cost runs away.
- alternative-toAgentic Skill Atrophy✕— Anti-pattern: let agents take over routine architectural and debugging decisions in code until developers no longer form the implicit knowledge that lets them review the agent's output or recover when it fails.
- alternative-toAgentic Debt✕— Anti-pattern: deploy agents on top of an unconsolidated data foundation, weak governance, or missing MLOps infrastructure, so every subsequent capability — observability, retraining, compliance retrofit — pays compounding interest on the skipped foundational work.
- complementsOwn Your Prompts (12-Factor Agents)★— Every prompt in a production agent is versioned, tested, and owned by the team in the application repo — never inherited as a framework default.
- complementsStochastic-Deterministic Boundary (SDB)★— Formalize the seam between an LLM proposal and a system action as a four-part contract — proposer, verifier, commit step, reject signal — so the contract itself, not the agent's good intent, gates side-effects.
- complementsDemo-Production Cliff (Multi-Agent)✕— Anti-pattern: multi-agent pilot benchmarks at 95% accuracy / 2s latency on a curated demo set, then degrades to ~80% / 40s under realistic 10k-RPD load.
- complementsRed-Team Sandbox Reproduction★— Routinely re-reproduce canonical alignment-failure modes inside a sealed sandbox per release; treat the alignment regression suite as a deployment gate.
- complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.