Methodology · Evaluationprovenverified

AI-as-Judge Evaluation

also known as LLM-as-judge, model-graded eval

Applies to: agentllm-apprag-systemcoding-agent

Tags: judge-modelcalibrationopen-ended-grading

Use a strong model as a judge. It grades the outputs of the system you are testing against a fixed rubric. This is faster and cheaper than having people grade at scale. It also works for open-ended tasks where an exact-match checker cannot help. How well it works depends on three things: a clear rubric, a judge you have checked against people, and knowing the judge's own biases.

Methodology process overview

flowchart TD eval[Eval set] --> s3[Run on the eval set] judge[Judge model] --> s1[Author rubric prompt] s1 --> rubric[(Frozen rubric prompt)] rubric --> s2[Calibrate vs humans] humans[Domain experts] --> s2 s2 --> kappa{Kappa >= 0.6?} kappa -->|no| s1 kappa -->|yes| s3 s3 --> scores[Per-example scores] scores --> s4[Aggregate and version] s4 --> agg[Aggregate metrics pinned to judge+rubric version] agg --> s5[Periodic re-calibration] s5 --> drift{Judge drift?} drift -->|yes| s1 drift -->|no| agg

Intent. Get a repeatable number score for open-ended outputs by handing the grading to a checked model instead of people.

When to apply. Use this for open-ended tasks such as summaries, explanations, chat replies, and code explanations. These are cases where an exact-match checker does not fit and human grading is too slow to run on every change. Don't apply it when the judge model is weaker than the system you are testing, because then the scores just ceiling out at the judge's blind spots. Skip it too when the rubric cannot be written tightly enough for a model to follow.

Example scenario

A team building a customer-support summariser needs to grade 4,000 daily summaries for faithfulness, tone, and length. That is far too many for people to review by hand. They picked Claude Opus as the judge. The system under test runs on Sonnet, so the judge is clearly stronger on support conversations. The rubric prompt names three things to score, each on a 1-5 scale, with clear anchors. Faithfulness 5 means 'every factual claim is grounded in the source ticket'. Faithfulness 1 means 'made-up entities or numbers'. Before trusting the judge, they had two experts grade 80 sampled summaries without seeing the judge's output. The first pass scored a Cohen's kappa agreement of 0.48, which was too low. Most of the disagreement was about tone, where the rubric anchors were too vague. They rewrote the tone descriptions with concrete examples. For instance, '5 = warm without being syrupy, e.g. acknowledges frustration once and moves to action'. They re-ran the check and hit 0.71. The rubric got pinned to v3. From then on, every nightly run scored every summary on the v3 rubric, using Opus-2026-04 as the judge. Three months in, Anthropic released a newer judge model. The team treated the swap as a fresh baseline. They re-checked against new human ratings, hit a kappa of 0.74, and only then trusted scores under the new judge version. They also caught one quiet regression. A prompt change had nudged the system toward 'cheerful' tone-fives that the judge happily rewarded but humans found grating. They found it in the quarterly spot-check.

Inputs

Eval set — The inputs to be graded. Ideally each comes with a correct output or a set of rules to check against.
Rubric prompt — Fixed instructions for the judge. They name what to score and the rating scale to use.
Judge model — A model at least as strong as the system you are testing, in the subject area that matters.

Outputs

Per-example scores — A number score for each input, broken out by each thing you are scoring.
Aggregate metrics — The roll-up across the whole test set: average score, how much it varies, and pass rate.
Judge calibration log — A record of regular human spot-checks that confirm the judge still agrees with the experts.

Steps (5)

Author the rubric prompt
Spell out what to grade, such as faithfulness, relevance, and brevity. Set the rating scale, such as 1 to 5, pass/fail, or A-vs-B. Say what each level means. Make the judge return a structured score, not free prose.
Calibrate against human ratings
Have experts grade a sample of 50 to 100 outputs using the same rubric. Compare their scores to the judge's. Tweak the rubric until the judge and the people agree closely enough. A common bar is a Cohen's kappa agreement score of 0.6 or higher.
Run on the eval set
Score every output. Use the same judge model and the same rubric version for every run. That keeps scores comparable across rounds.
Aggregate and version
Roll up the scores and tie them to the judge-model version and the rubric version. Changing either one means you have to set a fresh baseline.
Periodically re-calibrate
Collect fresh human ratings every few months, or whenever the judge model is upgraded. Models drift over time. Checking the judge against people is ongoing upkeep, not a one-time setup.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

The judge must be stronger than the system it grades. Otherwise scores stop at the judge's blind spots.
Asking the judge to pick A over B is more reliable than asking for an absolute score.
A judge prompt is frozen. Changing it makes your old scores no longer comparable.
Human spot-checks are not optional. They are how you catch the judge drifting.

Known failure modes (2)

Related patterns (2)

Related compositions (1)

recipe · abstract shape
Eval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.

Related methodologies (1)

Evaluation-Driven Development★★
6 steps
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (5)

Author the rubric prompt

Calibrate against human ratings

Run on the eval set

Aggregate and version

Periodically re-calibrate