Governance & Observability

Agent-as-a-Judge

Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.

Problem

A simple grader that looks only at the final answer cannot tell two agents apart when one solved the task cleanly and the other thrashed through twenty redundant tool calls, made a write outside its workspace, or stumbled into the right answer by luck. Process failures such as wasted spend, unsafe actions, or fragile reasoning are completely invisible to answer-only scoring. The team is forced to choose between cheap-but-shallow grading and expensive manual review of every run.

Solution

A judge agent receives the candidate agent's full trajectory: thoughts, tool calls, observations, intermediate state, and final answer. It evaluates against a rubric covering correctness, efficiency, and process quality. Outputs a structured verdict with rationale.

When to use

  • Agent tasks succeed or fail along their trajectory in ways the final answer cannot reveal.
  • You have access to the full trajectory (thoughts, tool calls, observations) of the candidate agent.
  • Process-quality signals (efficiency, redundant steps, unsafe actions) matter for the eval verdict, not just correctness.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related