Methodology · Evaluationprovenverified

Evaluation-Driven Development

also known as eval-first development, EDD

Applies to: agentrag-systemllm-appcoding-agent

Tags: eval-firstrubricregression-gate

Build the test harness before you build the large language model (LLM) app. Write down a grading rubric and a set of test cases first, then save both with a version tag. From then on, those scores decide which model you pick, how you change the prompt, and whether any change ships. If you write the rubric after the prompt, the rubric just rewards what the prompt already does well. Freezing the test first stops that.

Methodology process overview

Intent. Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.

When to apply. Use this when you start any LLM app beyond a one-off demo. Examples: search over policy docs, an agent that fills CRM fields, a chatbot for one narrow workflow, or a grader for coding tasks. Reach for it before you write the first prompt. Don't apply it when you are still exploring an open research question with no product to ship, because you cannot know the rubric yet.

Inputs

  • Task definitionA clear statement of what the system must do. Include the subject area and who the user is.
  • Real or synthetic input distributionSample user inputs that cover the parts of the task that matter: different formats, edge cases, and tricky inputs meant to break things.
  • Notion of 'good'An expert in the field who can say, even roughly, what a correct or acceptable output looks like.

Outputs

  • Versioned eval setA frozen set of inputs, and sometimes the matching correct outputs, checked into the repo with a version tag.
  • Rubric or checkerAn automatic way to grade outputs. It can be a fixed checker, a model-judge prompt against a frozen rubric, an exact match to expected output, or a mix.
  • Metrics dashboardA repeatable run that produces a score for any candidate prompt, model, or setup.

Steps (6)

  1. Articulate 'good' as a checkable rubric

    Before any prompt exists, write down what a correct answer looks like. Use a fixed checker for structured tasks. Use a model-judge against a frozen rubric for open-ended tasks. Compare against expected output when you have known-correct answers.

  2. Curate the eval set

    Collect 50 to 500 inputs from real user traffic, or make up inputs that cover the parts of the task that matter. Add tricky inputs and edge cases on purpose, not as an afterthought.

  3. Freeze and version

    Commit the eval set and rubric with a version tag. Treat any change to either like a code change. That means reviewed, dated, and explained.

  4. Measure baseline

    Run the cheapest setup that might work against the test. That could be zero-shot on a small model, or a basic search-plus-prompt. Record the score. This is the floor every later change has to beat.

  5. Iterate with eval as the only ground truth

    Run the test for every prompt edit, model swap, search change, or new tool. Ship the changes that improve the score. Revert the ones that drop it. Resist the urge to bless a change because a side demo 'feels better'.

  6. Refresh the eval on shifts

    When production traffic changes or a new failure shows up, add cases in a new versioned block. Never quietly overwrite the old baseline. Keep the change visible so you can still compare against history.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • The test is the spec. If it does not measure something, that something does not ship.
  • Freeze before you build. A rubric written after the first prompt is shaped by that prompt.
  • Cover the parts that matter, not just the happy path. Tricky inputs and edge cases count as real inputs.
  • Every change runs the test, every time. No exceptions for 'small' tweaks.

Known failure modes (2)

Related patterns (3)

Related compositions (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified