Methodology · Evaluationprovenverified

AI-as-Judge Evaluation

also known as LLM-as-judge, model-graded eval

Applies to: agentllm-apprag-systemcoding-agent

Tags: judge-modelcalibrationopen-ended-grading

Use a strong model as a judge. It grades the outputs of the system you are testing against a fixed rubric. This is faster and cheaper than having people grade at scale. It also works for open-ended tasks where an exact-match checker cannot help. How well it works depends on three things: a clear rubric, a judge you have checked against people, and knowing the judge's own biases.

Methodology process overview

Intent. Get a repeatable number score for open-ended outputs by handing the grading to a checked model instead of people.

When to apply. Use this for open-ended tasks such as summaries, explanations, chat replies, and code explanations. These are cases where an exact-match checker does not fit and human grading is too slow to run on every change. Don't apply it when the judge model is weaker than the system you are testing, because then the scores just ceiling out at the judge's blind spots. Skip it too when the rubric cannot be written tightly enough for a model to follow.

Inputs

  • Eval setThe inputs to be graded. Ideally each comes with a correct output or a set of rules to check against.
  • Rubric promptFixed instructions for the judge. They name what to score and the rating scale to use.
  • Judge modelA model at least as strong as the system you are testing, in the subject area that matters.

Outputs

  • Per-example scoresA number score for each input, broken out by each thing you are scoring.
  • Aggregate metricsThe roll-up across the whole test set: average score, how much it varies, and pass rate.
  • Judge calibration logA record of regular human spot-checks that confirm the judge still agrees with the experts.

Steps (5)

  1. Author the rubric prompt

    Spell out what to grade, such as faithfulness, relevance, and brevity. Set the rating scale, such as 1 to 5, pass/fail, or A-vs-B. Say what each level means. Make the judge return a structured score, not free prose.

  2. Calibrate against human ratings

    Have experts grade a sample of 50 to 100 outputs using the same rubric. Compare their scores to the judge's. Tweak the rubric until the judge and the people agree closely enough. A common bar is a Cohen's kappa agreement score of 0.6 or higher.

  3. Run on the eval set

    Score every output. Use the same judge model and the same rubric version for every run. That keeps scores comparable across rounds.

  4. Aggregate and version

    Roll up the scores and tie them to the judge-model version and the rubric version. Changing either one means you have to set a fresh baseline.

  5. Periodically re-calibrate

    Collect fresh human ratings every few months, or whenever the judge model is upgraded. Models drift over time. Checking the judge against people is ongoing upkeep, not a one-time setup.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • The judge must be stronger than the system it grades. Otherwise scores stop at the judge's blind spots.
  • Asking the judge to pick A over B is more reliable than asking for an absolute score.
  • A judge prompt is frozen. Changing it makes your old scores no longer comparable.
  • Human spot-checks are not optional. They are how you catch the judge drifting.

Known failure modes (2)

Related patterns (2)

Related compositions (1)

Related methodologies (1)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified