Agent-as-a-Judge
Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
Problem
A simple grader that looks only at the final answer cannot tell two agents apart when one solved the task cleanly and the other thrashed through twenty redundant tool calls, made a write outside its workspace, or stumbled into the right answer by luck. Process failures such as wasted spend, unsafe actions, or fragile reasoning are completely invisible to answer-only scoring. The team is forced to choose between cheap-but-shallow grading and expensive manual review of every run.
Solution
A judge agent receives the candidate agent's full trajectory: thoughts, tool calls, observations, intermediate state, and final answer. It evaluates against a rubric covering correctness, efficiency, and process quality. Outputs a structured verdict with rationale.
When to use
- Agent tasks succeed or fail along their trajectory in ways the final answer cannot reveal.
- You have access to the full trajectory (thoughts, tool calls, observations) of the candidate agent.
- Process-quality signals (efficiency, redundant steps, unsafe actions) matter for the eval verdict, not just correctness.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.