Methodology · Evaluation

Evaluation-Driven Development

Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.

Description

Build the test harness before you build the large language model (LLM) app. Write down a grading rubric and a set of test cases first, then save both with a version tag. From then on, those scores decide which model you pick, how you change the prompt, and whether any change ships. If you write the rubric after the prompt, the rubric just rewards what the prompt already does well. Freezing the test first stops that.

When to apply

Use this when you start any LLM app beyond a one-off demo. Examples: search over policy docs, an agent that fills CRM fields, a chatbot for one narrow workflow, or a grader for coding tasks. Reach for it before you write the first prompt. Don't apply it when you are still exploring an open research question with no product to ship, because you cannot know the rubric yet.

What it involves

Articulate 'good' as a checkable rubric
Curate the eval set
Freeze and version
Measure baseline
Iterate with eval as the only ground truth
Refresh the eval on shifts

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related