LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
Problem
Exact-match scoring fails on free-form outputs because there are many acceptable answers, and similarity metrics on raw text miss the qualities the team actually cares about such as faithfulness, completeness, or tone. Pure human grading is too slow to gate a CI pipeline that runs many times per day. The team is forced to choose between cheap-but-blind metrics that miss real regressions and expensive human review that does not scale.
Solution
Define a rubric. Prompt a judge model with the input, candidate output, and rubric. Receive a structured score plus rationale. Calibrate periodically against human-graded samples. Use a different model family for judge vs candidate where possible.
When to use
- Open-ended outputs need automated regression detection without a reference answer.
- A rubric can be written that covers the qualities you actually care about.
- Calibration against human-graded samples is feasible periodically.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.
Related
- Eval Harness
- Evaluator-Optimizer
- Agent-as-a-Judge
- Shadow Canary
- Blind Grader with Isolated Context
- Scorer Live Monitoring
- Reward Hacking
- Sycophancy
- Cross-Reflection
- Generator-Critic Separation
- Heterogeneous-Model Council with Synthesis Judge
- Intermediate Artifact Evaluation
- Agent Evaluator
- Sampled Prompt Trace Eval
- Dimensional Synthetic Eval Set
- Prompt Variant Evaluation