Evaluation-Driven Development
also known as eval-first development, EDD
Build the test harness before you build the large language model (LLM) app. Write down a grading rubric and a set of test cases first, then save both with a version tag. From then on, those scores decide which model you pick, how you change the prompt, and whether any change ships. If you write the rubric after the prompt, the rubric just rewards what the prompt already does well. Freezing the test first stops that.
Methodology process overview
Intent. Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
When to apply. Use this when you start any LLM app beyond a one-off demo. Examples: search over policy docs, an agent that fills CRM fields, a chatbot for one narrow workflow, or a grader for coding tasks. Reach for it before you write the first prompt. Don't apply it when you are still exploring an open research question with no product to ship, because you cannot know the rubric yet.
Inputs
- Task definition — A clear statement of what the system must do. Include the subject area and who the user is.
- Real or synthetic input distribution — Sample user inputs that cover the parts of the task that matter: different formats, edge cases, and tricky inputs meant to break things.
- Notion of 'good' — An expert in the field who can say, even roughly, what a correct or acceptable output looks like.
Outputs
- Versioned eval set — A frozen set of inputs, and sometimes the matching correct outputs, checked into the repo with a version tag.
- Rubric or checker — An automatic way to grade outputs. It can be a fixed checker, a model-judge prompt against a frozen rubric, an exact match to expected output, or a mix.
- Metrics dashboard — A repeatable run that produces a score for any candidate prompt, model, or setup.
Steps (6)
Articulate 'good' as a checkable rubric
Before any prompt exists, write down what a correct answer looks like. Use a fixed checker for structured tasks. Use a model-judge against a frozen rubric for open-ended tasks. Compare against expected output when you have known-correct answers.
Curate the eval set
Collect 50 to 500 inputs from real user traffic, or make up inputs that cover the parts of the task that matter. Add tricky inputs and edge cases on purpose, not as an afterthought.
Freeze and version
Commit the eval set and rubric with a version tag. Treat any change to either like a code change. That means reviewed, dated, and explained.
Measure baseline
Run the cheapest setup that might work against the test. That could be zero-shot on a small model, or a basic search-plus-prompt. Record the score. This is the floor every later change has to beat.
Iterate with eval as the only ground truth
Run the test for every prompt edit, model swap, search change, or new tool. Ship the changes that improve the score. Revert the ones that drop it. Resist the urge to bless a change because a side demo 'feels better'.
Refresh the eval on shifts
When production traffic changes or a new failure shows up, add cases in a new versioned block. Never quietly overwrite the old baseline. Keep the change visible so you can still compare against history.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- The test is the spec. If it does not measure something, that something does not ship.
- Freeze before you build. A rubric written after the first prompt is shaped by that prompt.
- Cover the parts that matter, not just the happy path. Tricky inputs and edge cases count as real inputs.
- Every change runs the test, every time. No exceptions for 'small' tweaks.
Known failure modes (2)
Related patterns (3)
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
- ★Sampled Prompt Trace Eval
Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
Related compositions (2)
- recipe · abstract shapeEval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.
- recipe · abstract shapeProduction LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.
Sources (2)
AI Engineering
Ch 3 'Evaluation Methodology'; Ch 4 'Evaluate AI Systems' “Challenges in Evaluating Foundation Models ... AI as Judge ... Comparative Evaluation for Model Ranking”
aie-book chapter-summaries.md (Chip Huyen)
Ch 3 'Evaluation Methodology'; Ch 4 'Evaluate AI Systems' “Unlike exact evaluation, subjective metrics are highly dependent on the judge. Their scores need to be interpreted in the context of what judges are being used ... AI judges, like all AI applications, should be iterated upon ... It's impos…”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified