X · Governance & ObservabilityEmerging

Agent-as-a-Judge

also known as Trajectory Evaluator, Judge Agent

Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.

This pattern helps complete certain larger patterns —

  • specialisesLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
  • used-byScorer Live MonitoringScore agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
  • used-byRigor RelocationRelocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.
  • used-byTrust and Reputation RoutingMaintain a per-agent reputation score updated from outcome quality and peer feedback, and route new tasks preferentially to high-reputation agents.

Context

A team is evaluating an agent that solves multi-step tasks, such as fixing a bug in a real codebase or completing a chain of tool calls to answer a question. The agent emits a full trajectory: each intermediate thought, every tool call it issued, every observation it received, and a final answer. The team wants to know not just whether the final answer is right, but whether the agent got there through reasonable steps.

Problem

A simple grader that looks only at the final answer cannot tell two agents apart when one solved the task cleanly and the other thrashed through twenty redundant tool calls, made a write outside its workspace, or stumbled into the right answer by luck. Process failures such as wasted spend, unsafe actions, or fragile reasoning are completely invisible to answer-only scoring. The team is forced to choose between cheap-but-shallow grading and expensive manual review of every run.

Forces

  • Trajectory evaluation is more expensive than answer-only judging.
  • Judge agents have their own biases and failure modes.
  • Trajectory schemas vary per agent framework.

Example

A team running a coding-agent benchmark notices that two agent versions get the same final answer but one wastes twenty extra tool calls and once tried to write outside the workspace. Scoring only the final patch, both look equal. They wire in an Agent-as-Judge that reads each full trajectory — every thought, tool call, and observation — and rates correctness, efficiency, and safety against a rubric. The wasteful version drops to a lower verdict and is sent back for tuning before the change merges.

Diagram

Solution

Therefore:

A judge agent receives the candidate agent's full trajectory: thoughts, tool calls, observations, intermediate state, and final answer. It evaluates against a rubric covering correctness, efficiency, and process quality. Outputs a structured verdict with rationale.

What this pattern forbids. The judge sees the full trajectory, not just the final output; answer-only evaluation is not used in this pattern.

The smaller patterns that complete this one —

  • usesEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.
  • usesDecision Log★★Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.

And the patterns that stand alongside it, or against it —

  • alternative-toBlind Grader with Isolated ContextRun an evaluator in a separately-allocated context window with access only to the artifact and the rubric, never the producing agent's reasoning trace, so the grader cannot be primed by the producer's framing.
  • alternative-toCascading Agent FailuresAnti-pattern: build a multi-agent system where one agent's failure or hallucination propagates as input to peers, until the whole system has drifted.
  • alternative-toReward HackingAnti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
  • alternative-toSycophancyAnti-pattern: train or tune an agent on user-preference feedback without a counter-balancing truth signal.
  • alternative-toAgent SchemingAnti-pattern: deploy an agent with long horizons, persistent memory, and oversight that only inspects per-step output — allowing multi-step covert planning under the surface.
  • complementsAgent EvaluatorA dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
  • complementsSampled Prompt Trace EvalCapture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.