Eval as Contract
Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
Problem
If the eval suite is something the team runs by hand and looks at when they remember to, regressions slip through silently: a prompt tweak goes out on Tuesday, the eval suite is not run, and by Thursday quality has dropped without anyone noticing. The suite turns into aspirational documentation rather than an actual constraint on releases. The team is forced to choose between trusting vibes between deploys or treating the eval suite the way they would treat a failing unit test.
Solution
Define a tiered eval suite: blocking evals (must pass for release), advisory evals (tracked but not blocking). Wire blocking evals into CI. Block PRs and releases when blocking evals fail. Treat eval changes as architectural changes (review, signoff).
When to use
- An eval suite exists that can be tiered into blocking and advisory.
- CI can be wired so blocking eval failures actually prevent release.
- The team is willing to treat eval changes as architectural changes (review and signoff).
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.
Related
- Eval Harness
- Shadow Canary
- Perma-Beta
- Prompt Versioning
- Automatic Workflow Search
- Demo-to-Production Cliff
- Agentic Skill Atrophy
- Agentic Debt
- Rigor Relocation
- Own Your Prompts (12-Factor Agents)
- Stochastic-Deterministic Boundary (SDB)
- Demo-Production Cliff (Multi-Agent)
- Red-Team Sandbox Reproduction
- Intermediate Artifact Evaluation