Governance & Observability

LLM-as-Judge

Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

Problem

Exact-match scoring fails on free-form outputs because there are many acceptable answers, and similarity metrics on raw text miss the qualities the team actually cares about such as faithfulness, completeness, or tone. Pure human grading is too slow to gate a CI pipeline that runs many times per day. The team is forced to choose between cheap-but-blind metrics that miss real regressions and expensive human review that does not scale.

Solution

Define a rubric. Prompt a judge model with the input, candidate output, and rubric. Receive a structured score plus rationale. Calibrate periodically against human-graded samples. Use a different model family for judge vs candidate where possible.

When to use

  • Open-ended outputs need automated regression detection without a reference answer.
  • A rubric can be written that covers the qualities you actually care about.
  • Calibration against human-graded samples is feasible periodically.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related