X · Governance & ObservabilityEmerging

Agent Evaluator

also known as Agent-Performance Testing Harness, Dedicated Agent-Test Agent

A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).

Context

A team has an agent in production. Quality is measured via final-output eval and ad-hoc sampling. There is no standing component whose role is *to test the agent* — testing happens during development and stops once shipped.

Problem

Without a dedicated agent-evaluator role, agent quality measurement is human-driven and bursty. The agent-evaluator pattern names this as a standing component: an agent (possibly automated, possibly LLM-driven) whose job is to test the production agent on an ongoing basis. Differs from eval-harness (offline batch) by being an active, ongoing tester; from llm-as-judge by being agent-level not output-level.

Forces

  • Agent-evaluator is another agent to operate — more infrastructure.
  • Designing meaningful agent-evaluator tests requires domain knowledge.
  • Tests can become rituals if not maintained.

Example

A customer-service agent has a sibling agent-evaluator. Every hour the evaluator: generates 50 test inputs (10 from curated suite, 30 from production-traffic variations, 10 synthetic edge cases), submits to the production agent, judges outputs via LLM-as-judge + deterministic checks, posts metrics. Dashboard shows pass rate over time; a 4% drop triggers PagerDuty.

Diagram

Solution

Therefore:

Agent-evaluator runs continuously or on a cadence. Generates test inputs from (a) a curated suite, (b) variations of production traffic, (c) synthetic edge cases. Submits to the production agent. Judges outputs (LLM-as-judge or deterministic check). Reports pass-rate metrics over time. Pair with eval-harness, llm-as-judge, dual-evaluation-offline-online, artifact-evaluation.

What this pattern forbids. Agent-evaluator is a standing component, not an ad-hoc tool; tests run on a cadence, results are dashboarded.

And the patterns that stand alongside it, or against it —

  • complementsEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.
  • complementsLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
  • complementsDual Evaluation (Offline + Online)Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
  • complementsIntermediate Artifact EvaluationEvaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
  • complementsAgent-as-a-JudgeEvaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
  • complementsDecision Context MapsBefore any consequential decision, require the agent to gather a declared set of contextual inputs (resource availability, schedules, downstream dependencies) into a 'context map' the decision must cite.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance