Agent Evaluator
also known as Agent-Performance Testing Harness, Dedicated Agent-Test Agent
A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
Context
A team has an agent in production. Quality is measured via final-output eval and ad-hoc sampling. There is no standing component whose role is *to test the agent* — testing happens during development and stops once shipped.
Problem
Without a dedicated agent-evaluator role, agent quality measurement is human-driven and bursty. The agent-evaluator pattern names this as a standing component: an agent (possibly automated, possibly LLM-driven) whose job is to test the production agent on an ongoing basis. Differs from eval-harness (offline batch) by being an active, ongoing tester; from llm-as-judge by being agent-level not output-level.
Forces
- Agent-evaluator is another agent to operate — more infrastructure.
- Designing meaningful agent-evaluator tests requires domain knowledge.
- Tests can become rituals if not maintained.
Example
A customer-service agent has a sibling agent-evaluator. Every hour the evaluator: generates 50 test inputs (10 from curated suite, 30 from production-traffic variations, 10 synthetic edge cases), submits to the production agent, judges outputs via LLM-as-judge + deterministic checks, posts metrics. Dashboard shows pass rate over time; a 4% drop triggers PagerDuty.
Diagram
Solution
Therefore:
Agent-evaluator runs continuously or on a cadence. Generates test inputs from (a) a curated suite, (b) variations of production traffic, (c) synthetic edge cases. Submits to the production agent. Judges outputs (LLM-as-judge or deterministic check). Reports pass-rate metrics over time. Pair with eval-harness, llm-as-judge, dual-evaluation-offline-online, artifact-evaluation.
What this pattern forbids. Agent-evaluator is a standing component, not an ad-hoc tool; tests run on a cadence, results are dashboarded.
And the patterns that stand alongside it, or against it —
- complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- complementsLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- complementsDual Evaluation (Offline + Online)★— Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
- complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
- complementsAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
- complementsDecision Context Maps★— Before any consequential decision, require the agent to gather a declared set of contextual inputs (resource availability, schedules, downstream dependencies) into a 'context map' the decision must cite.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.