AI-as-Judge Evaluation
Get a repeatable number score for open-ended outputs by handing the grading to a checked model instead of people.
Description
Use a strong model as a judge. It grades the outputs of the system you are testing against a fixed rubric. This is faster and cheaper than having people grade at scale. It also works for open-ended tasks where an exact-match checker cannot help. How well it works depends on three things: a clear rubric, a judge you have checked against people, and knowing the judge's own biases.
When to apply
Use this for open-ended tasks such as summaries, explanations, chat replies, and code explanations. These are cases where an exact-match checker does not fit and human grading is too slow to run on every change. Don't apply it when the judge model is weaker than the system you are testing, because then the scores just ceiling out at the judge's blind spots. Skip it too when the rubric cannot be written tightly enough for a model to follow.
What it involves
- Author the rubric prompt
- Calibrate against human ratings
- Run on the eval set
- Aggregate and version
- Periodically re-calibrate
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.