Methodology · Evaluationprovenverified

Evaluation-Driven Development

also known as eval-first development, EDD

Applies to: agentrag-systemllm-appcoding-agent

Tags: eval-firstrubricregression-gate

Build the test harness before you build the large language model (LLM) app. Write down a grading rubric and a set of test cases first, then save both with a version tag. From then on, those scores decide which model you pick, how you change the prompt, and whether any change ships. If you write the rubric after the prompt, the rubric just rewards what the prompt already does well. Freezing the test first stops that.

Methodology process overview

flowchart TD task[Task definition] --> s1[Articulate rubric] expert[Domain experts] --> s1 s1 --> s2[Curate eval set] traffic[Real or synthetic inputs] --> s2 s2 --> s3[Freeze and version] s3 --> rubric[(Versioned eval set + rubric)] rubric --> s4[Measure baseline] s4 --> floor[Baseline score floor] floor --> s5[Iterate against eval] change[Prompt / model / tool change] --> s5 s5 --> ship{Beats floor?} ship -->|yes| s5b[Ship change, raise floor] ship -->|no| s5c[Revert] s5b --> s6[Refresh eval on drift] s5c --> s6 prod[Production drift / new failure mode] --> s6 s6 --> rubric

Intent. Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.

When to apply. Use this when you start any LLM app beyond a one-off demo. Examples: search over policy docs, an agent that fills CRM fields, a chatbot for one narrow workflow, or a grader for coding tasks. Reach for it before you write the first prompt. Don't apply it when you are still exploring an open research question with no product to ship, because you cannot know the rubric yet.

Example scenario

A two-person team at a contracts SaaS is building a clause-extractor. It reads uploaded NDAs and pulls out the governing-law clause, the term length, and the exclusivity scope. Before writing a single prompt, they spent an afternoon with the in-house lawyer. Together they wrote a checklist for a correct extraction: the exact text of the governing-law jurisdiction, the term as an ISO-formatted number of months, and a set choice for exclusivity (none/one-way/mutual). They gathered 180 sample NDAs from public filings and their own anonymised set. Thirty of those were deliberately tricky: handwritten signatures embedded as images, mixed-jurisdiction clauses, and missing fields. They froze the test set as evals/v1/ in the repo. A fixed checker handled the structured fields. A model-judge prompt with a fixed rubric handled the free-text jurisdiction string. They ran zero-shot Claude Haiku as the baseline. It scored 61% exact-field match and 78% judge-passing on jurisdiction. Every later change ran the same test: a prompt edit, search over a case-law glossary, a swap to Sonnet, and a new citation-extraction tool. The Sonnet swap moved the score to 84% / 91% and shipped. A 'helpful' rewording of the prompt dropped exclusivity from 88% to 71% and got reverted within the hour. What they learned: the rubric written before the prompt held up. Late in the project the lawyer asked if they could 'just relax the exclusivity scoring'. They said no, because it would have invalidated three weeks of comparable scores. They did add a v2 block of 40 new edge cases when a Spanish-language NDA showed up in production traffic. They kept v1 visible in the dashboard so the historical baseline never moved.

Inputs

Task definition — A clear statement of what the system must do. Include the subject area and who the user is.
Real or synthetic input distribution — Sample user inputs that cover the parts of the task that matter: different formats, edge cases, and tricky inputs meant to break things.
Notion of 'good' — An expert in the field who can say, even roughly, what a correct or acceptable output looks like.

Outputs

Versioned eval set — A frozen set of inputs, and sometimes the matching correct outputs, checked into the repo with a version tag.
Rubric or checker — An automatic way to grade outputs. It can be a fixed checker, a model-judge prompt against a frozen rubric, an exact match to expected output, or a mix.
Metrics dashboard — A repeatable run that produces a score for any candidate prompt, model, or setup.

Steps (6)

Articulate 'good' as a checkable rubric
Before any prompt exists, write down what a correct answer looks like. Use a fixed checker for structured tasks. Use a model-judge against a frozen rubric for open-ended tasks. Compare against expected output when you have known-correct answers.
Curate the eval set
Collect 50 to 500 inputs from real user traffic, or make up inputs that cover the parts of the task that matter. Add tricky inputs and edge cases on purpose, not as an afterthought.
Freeze and version
Commit the eval set and rubric with a version tag. Treat any change to either like a code change. That means reviewed, dated, and explained.
Measure baseline
Run the cheapest setup that might work against the test. That could be zero-shot on a small model, or a basic search-plus-prompt. Record the score. This is the floor every later change has to beat.
Iterate with eval as the only ground truth
Run the test for every prompt edit, model swap, search change, or new tool. Ship the changes that improve the score. Revert the ones that drop it. Resist the urge to bless a change because a side demo 'feels better'.
Refresh the eval on shifts
When production traffic changes or a new failure shows up, add cases in a new versioned block. Never quietly overwrite the old baseline. Keep the change visible so you can still compare against history.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

The test is the spec. If it does not measure something, that something does not ship.
Freeze before you build. A rubric written after the first prompt is shaped by that prompt.
Cover the parts that matter, not just the happy path. Tricky inputs and edge cases count as real inputs.
Every change runs the test, every time. No exceptions for 'small' tweaks.

Evaluation-Driven Development

Methodology process overview

Steps (6)

Articulate 'good' as a checkable rubric

Curate the eval set

Freeze and version

Measure baseline

Iterate with eval as the only ground truth

Refresh the eval on shifts

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (3)

Related compositions (2)

Sources (2)

Provenance