Evaluation-Driven Development
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
Description
Build the test harness before you build the large language model (LLM) app. Write down a grading rubric and a set of test cases first, then save both with a version tag. From then on, those scores decide which model you pick, how you change the prompt, and whether any change ships. If you write the rubric after the prompt, the rubric just rewards what the prompt already does well. Freezing the test first stops that.
When to apply
Use this when you start any LLM app beyond a one-off demo. Examples: search over policy docs, an agent that fills CRM fields, a chatbot for one narrow workflow, or a grader for coding tasks. Reach for it before you write the first prompt. Don't apply it when you are still exploring an open research question with no product to ship, because you cannot know the rubric yet.
What it involves
- Articulate 'good' as a checkable rubric
- Curate the eval set
- Freeze and version
- Measure baseline
- Iterate with eval as the only ground truth
- Refresh the eval on shifts
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.