Methodology · Evaluation

AI-as-Judge Evaluation

Get a repeatable number score for open-ended outputs by handing the grading to a checked model instead of people.

Description

Use a strong model as a judge. It grades the outputs of the system you are testing against a fixed rubric. This is faster and cheaper than having people grade at scale. It also works for open-ended tasks where an exact-match checker cannot help. How well it works depends on three things: a clear rubric, a judge you have checked against people, and knowing the judge's own biases.

When to apply

Use this for open-ended tasks such as summaries, explanations, chat replies, and code explanations. These are cases where an exact-match checker does not fit and human grading is too slow to run on every change. Don't apply it when the judge model is weaker than the system you are testing, because then the scores just ceiling out at the judge's blind spots. Skip it too when the rubric cannot be written tightly enough for a model to follow.

What it involves

  • Author the rubric prompt
  • Calibrate against human ratings
  • Run on the eval set
  • Aggregate and version
  • Periodically re-calibrate

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related