LLM-as-Judge
also known as Model Grading, Auto-Evaluator
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
This pattern helps complete certain larger patterns —
- used-byEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- used-byEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
- used-byShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
- used-byScorer Live Monitoring★— Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
- used-by[evaluation-driven-development]
- used-bySampled Prompt Trace Eval★— Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
- used-byPrompt Variant Evaluation★★— Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.
Context
A team is evaluating an agent whose outputs are free-form text — summaries, generated code, long-form prose, support replies — where no single reference answer is uniquely correct. They want regression detection automated enough to run on every release or pull request, not paced by how many summaries a human can grade in a week. They are willing to write down what good looks like in the form of a rubric.
Problem
Exact-match scoring fails on free-form outputs because there are many acceptable answers, and similarity metrics on raw text miss the qualities the team actually cares about such as faithfulness, completeness, or tone. Pure human grading is too slow to gate a CI pipeline that runs many times per day. The team is forced to choose between cheap-but-blind metrics that miss real regressions and expensive human review that does not scale.
Forces
- Judges have biases (length, position, model-family preference).
- Calibration against human judgement is its own dataset.
- Same-model judging is suspect when the candidate is from the same family.
Example
A team running a summarisation eval relies on humans to grade 200 summaries per release, which takes a week and gates every deploy. They add llm-as-judge: a different model family scores each summary against a rubric (faithfulness, completeness, clarity) and emits a structured score plus rationale. They calibrate weekly against a 30-sample human-graded slice and flag drift. Releases now ship daily with an automated quality gate, and humans only spot-check.
Diagram
Solution
Therefore:
Define a rubric. Prompt a judge model with the input, candidate output, and rubric. Receive a structured score plus rationale. Calibrate periodically against human-graded samples. Use a different model family for judge vs candidate where possible.
What this pattern forbids. Scores are advisory unless calibrated against human judgement at known intervals.
The smaller patterns that complete this one —
- generalisesAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
- generalisesBlind Grader with Isolated Context★— Run an evaluator in a separately-allocated context window with access only to the artifact and the rubric, never the producing agent's reasoning trace, so the grader cannot be primed by the producer's framing.
And the patterns that stand alongside it, or against it —
- alternative-toReward Hacking✕— Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
- alternative-toSycophancy✕— Anti-pattern: train or tune an agent on user-preference feedback without a counter-balancing truth signal.
- complementsCross-Reflection★— Reflection step performed by a *different* agent or foundation model from the original generator, so critique error is decorrelated from generation error.
- complementsGenerator-Critic Separation★— Strict role separation between a Generator agent that produces drafts and a Critic agent that judges them against pre-defined criteria; the Critic never generates.
- complementsHeterogeneous-Model Council with Synthesis Judge★— Three or more role-specialized personas run on different model architectures in parallel; a synthesis judge — given only their structured JSON, not the original input — produces the final verdict.
- complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
- complementsAgent Evaluator★— A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
- complementsDimensional Synthetic Eval Set★— Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.