Eval as Contract

also known as Test-Driven Agent, Eval-Gated Release

Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.

This pattern helps complete certain larger patterns —

specialisesEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
used-byPrompt Versioning★★— Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
used-byRigor Relocation★— Relocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.
used-byScaffold Ablation on Model Upgrade★— On each model upgrade, treat every harness component as an encoded assumption about a model weakness and ablate the components the new model no longer needs, gated by evals.

Context

A team ships an agent to real users and is expected to keep a stable quality bar release after release. They have an evaluation suite — a held-out set of inputs paired with expected outputs or rubric checks — that already gives them a numeric read on quality. Stakeholders such as product, customers, and compliance depend on that bar holding from one release to the next.

Problem

If the eval suite is something the team runs by hand and looks at when they remember to, regressions slip through silently: a prompt tweak goes out on Tuesday, the eval suite is not run, and by Thursday quality has dropped without anyone noticing. The suite turns into aspirational documentation rather than an actual constraint on releases. The team is forced to choose between trusting vibes between deploys or treating the eval suite the way they would treat a failing unit test.

Forces

Contract authoring is up-front work.
Eval-suite drift if not maintained.
Calibration: which evals are blocking, which are advisory.

Example

A team improves their support agent's planning prompt and ships the change on a Tuesday. By Thursday, the agent's tool-selection accuracy on three known regressions has dropped, but no one notices because there's no gate. They adopt Eval-as-Contract: the held-out eval suite is treated as the release contract — every PR runs it, and any regression below threshold blocks the deploy. The eval suite stops being optional documentation and starts being the thing the agent has to satisfy.

Diagram

flowchart TD PR[Pull request] --> CI[Run tiered eval suite] CI --> B{Blocking evals pass?} B -- no --> X[Block release] B -- yes --> Adv{Advisory evals} Adv --> Track[Track regressions] Adv --> R[Release proceeds] EvChange[Eval change] --> Review[Architectural review + signoff]

Solution

Therefore:

Define a tiered eval suite: blocking evals (must pass for release), advisory evals (tracked but not blocking). Wire blocking evals into CI. Block PRs and releases when blocking evals fail. Treat eval changes as architectural changes (review, signoff).

What this pattern forbids. Releases are forbidden when blocking evals fail; bypassing requires explicit operator override.

The smaller patterns that complete this one —

generalisesRe-Contact-Subtracted Resolution Gate★— Gate a support agent on a re-contact-subtracted resolution rate so an interaction that merely ends the session is never reported as a resolved one.

And the patterns that stand alongside it, or against it —

complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
conflicts-withPerma-Beta✕— Anti-pattern: ship the agent in 'beta' indefinitely so that quality regressions are someone else's problem.
complementsAutomatic Workflow Search·— Treat the agent's workflow (a graph of LLM-invoking nodes) as an artefact to search; use Monte Carlo Tree Search guided by an eval benchmark to discover the best workflow, then deploy it.
alternative-toDemo-to-Production Cliff✕— Anti-pattern: ship a demo-validated agent straight into production without a frozen eval, cost ceiling, loop-detector, or named oncall, then act surprised when accuracy drops and cost runs away.
alternative-toAgentic Skill Atrophy✕— Anti-pattern: let agents take over routine architectural and debugging decisions in code until developers no longer form the implicit knowledge that lets them review the agent's output or recover when it fails.
alternative-toAgentic Debt✕— Anti-pattern: deploy agents on top of an unconsolidated data foundation, weak governance, or missing MLOps infrastructure, so every subsequent capability — observability, retraining, compliance retrofit — pays compounding interest on the skipped foundational work.
complementsOwn Your Prompts (12-Factor Agents)★— Every prompt in a production agent is versioned, tested, and owned by the team in the application repo — never inherited as a framework default.
complementsStochastic-Deterministic Boundary (SDB)★— Formalize the seam between an LLM proposal and a system action as a four-part contract — proposer, verifier, commit step, reject signal — so the contract itself, not the agent's good intent, gates side-effects.
complementsDemo-Production Cliff (Multi-Agent)✕— Anti-pattern: multi-agent pilot benchmarks at 95% accuracy / 2s latency on a curated demo set, then degrades to ~80% / 40s under realistic 10k-RPD load.
complementsRed-Team Sandbox Reproduction★— Routinely re-reproduce canonical alignment-failure modes inside a sealed sandbox per release; treat the alignment regression suite as a deployment gate.
complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
alternative-toCompliance-Certified Launch Gate★— Require an external regulator to certify the generative service against a published content-safety standard before it may serve the public, forcing the standard's controls into the build as a re-certifiable artifact.
complementsSilent Hypotheses in Generated Code✕— Anti-pattern: model-written code rests on an unstated runtime premise that passing tests and code review never surface, so the hidden assumption travels into production and fails there.
complementsBehavior-Pinning Test Before Agent Edit★— Capture the current behaviour of agent-touchable code as golden characterization tests before an agent edits it, with load-bearing values computed deterministically and only prose left to the model, run as a regression gate.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Eval & Observability
core

Used in frameworks

References

ai-standards/ai-design-patterns (Eval as Contract)
repo

Provenance

Source: patterns/eval-as-contract.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-22
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.