LLM-as-Judge

also known as Model Grading, Auto-Evaluator

Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

This pattern helps complete certain larger patterns —

used-byEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
used-byEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
used-byShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
used-byScorer Live Monitoring★— Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
used-by[evaluation-driven-development]
used-bySampled Prompt Trace Eval★— Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
used-byPrompt Variant Evaluation★★— Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.

Context

A team is evaluating an agent whose outputs are free-form text — summaries, generated code, long-form prose, support replies — where no single reference answer is uniquely correct. They want regression detection automated enough to run on every release or pull request, not paced by how many summaries a human can grade in a week. They are willing to write down what good looks like in the form of a rubric.

Problem

Exact-match scoring fails on free-form outputs because there are many acceptable answers, and similarity metrics on raw text miss the qualities the team actually cares about such as faithfulness, completeness, or tone. Pure human grading is too slow to gate a CI pipeline that runs many times per day. The team is forced to choose between cheap-but-blind metrics that miss real regressions and expensive human review that does not scale.

Forces

Judges have biases (length, position, model-family preference).
Calibration against human judgement is its own dataset.
Same-model judging is suspect when the candidate is from the same family.

Example

A team running a summarisation eval relies on humans to grade 200 summaries per release, which takes a week and gates every deploy. They add llm-as-judge: a different model family scores each summary against a rubric (faithfulness, completeness, clarity) and emits a structured score plus rationale. They calibrate weekly against a 30-sample human-graded slice and flag drift. Releases now ship daily with an automated quality gate, and humans only spot-check.

Diagram

sequenceDiagram participant C as Candidate model participant J as Judge model (different family) participant H as Human calibration set C->>J: input + candidate output + rubric J-->>C: structured score + rationale H->>J: periodically calibrate J-->>H: calibration drift report

Solution

Therefore:

Define a rubric. Prompt a judge model with the input, candidate output, and rubric. Receive a structured score plus rationale. Calibrate periodically against human-graded samples. Use a different model family for judge vs candidate where possible.

What this pattern forbids. Scores are advisory unless calibrated against human judgement at known intervals.

The smaller patterns that complete this one —

generalisesAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
generalisesBlind Grader with Isolated Context★— Run an evaluator in a separately-allocated context window with access only to the artifact and the rubric, never the producing agent's reasoning trace, so the grader cannot be primed by the producer's framing.

And the patterns that stand alongside it, or against it —

alternative-toReward Hacking✕— Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
alternative-toSycophancy✕— Anti-pattern: train or tune an agent on user-preference feedback without a counter-balancing truth signal.
complementsCross-Reflection★— Reflection step performed by a *different* agent or foundation model from the original generator, so critique error is decorrelated from generation error.
complementsGenerator-Critic Separation★— Strict role separation between a Generator agent that produces drafts and a Critic agent that judges them against pre-defined criteria; the Critic never generates.
complementsHeterogeneous-Model Council with Synthesis Judge★— Three or more role-specialized personas run on different model architectures in parallel; a synthesis judge — given only their structured JSON, not the original input — produces the final verdict.
complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
complementsAgent Evaluator★— A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
complementsDimensional Synthetic Eval Set★— Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.
alternative-toTrajectory Anomaly Monitor·— Run a trained, non-LLM verifier out-of-band over the agent's action trajectory at runtime to flag task-misaligned plans and malformed step sequences at millisecond latency, before the actions cause damage.

LLM-as-Judge

Context

Problem

Forces

Example

Diagram

Solution

Neighbourhood

Used in recipes

Used in frameworks

References

Provenance