Governance & Observability

Eval as Contract

Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.

Problem

If the eval suite is something the team runs by hand and looks at when they remember to, regressions slip through silently: a prompt tweak goes out on Tuesday, the eval suite is not run, and by Thursday quality has dropped without anyone noticing. The suite turns into aspirational documentation rather than an actual constraint on releases. The team is forced to choose between trusting vibes between deploys or treating the eval suite the way they would treat a failing unit test.

Solution

Define a tiered eval suite: blocking evals (must pass for release), advisory evals (tracked but not blocking). Wire blocking evals into CI. Block PRs and releases when blocking evals fail. Treat eval changes as architectural changes (review, signoff).

When to use

An eval suite exists that can be tiered into blocking and advisory.
CI can be wired so blocking eval failures actually prevent release.
The team is willing to treat eval changes as architectural changes (review and signoff).

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related