Agent Evaluator

also known as Agent-Performance Testing Harness, Dedicated Agent-Test Agent

A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).

Context

A team has an agent in production. Quality is measured via final-output eval and ad-hoc sampling. There is no standing component whose role is *to test the agent* — testing happens during development and stops once shipped.

Problem

Without a dedicated agent-evaluator role, agent quality measurement is human-driven and bursty. The agent-evaluator pattern names this as a standing component: an agent (possibly automated, possibly LLM-driven) whose job is to test the production agent on an ongoing basis. Differs from eval-harness (offline batch) by being an active, ongoing tester; from llm-as-judge by being agent-level not output-level.

Forces

Agent-evaluator is another agent to operate — more infrastructure.
Designing meaningful agent-evaluator tests requires domain knowledge.
Tests can become rituals if not maintained.

Example

A customer-service agent has a sibling agent-evaluator. Every hour the evaluator: generates 50 test inputs (10 from curated suite, 30 from production-traffic variations, 10 synthetic edge cases), submits to the production agent, judges outputs via LLM-as-judge + deterministic checks, posts metrics. Dashboard shows pass rate over time; a 4% drop triggers PagerDuty.

Diagram

flowchart TD Cad[Cadence trigger] --> Eval[Agent Evaluator] Eval --> Gen[Generate test inputs] Gen --> Prod[Production agent] Prod --> Out[Outputs] Out --> Judge[Judge outputs] Judge --> Dash[Dashboard + alerts]

Solution

Therefore:

Agent-evaluator runs continuously or on a cadence. Generates test inputs from (a) a curated suite, (b) variations of production traffic, (c) synthetic edge cases. Submits to the production agent. Judges outputs (LLM-as-judge or deterministic check). Reports pass-rate metrics over time. Pair with eval-harness, llm-as-judge, dual-evaluation-offline-online, artifact-evaluation.

What this pattern forbids. Agent-evaluator is a standing component, not an ad-hoc tool; tests run on a cadence, results are dashboarded.

And the patterns that stand alongside it, or against it —

complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
complementsLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
complementsDual Evaluation (Offline + Online)★— Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
complementsAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
complementsDecision Context Maps★— Before any consequential decision, require the agent to gather a declared set of contextual inputs (resource availability, schedules, downstream dependencies) into a 'context map' the decision must cite.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

DeepEval
core5 patternsEnterprise Platforms★★ mature
DeepEval is a dedicated testing harness whose sole job is running metric-based test cases against another LLM app's outputs, integrating with pytest so agent outputs are unit-test…

References

【論文紹介】LLMベースのAIエージェントのデザインパターン18選
blog

Provenance

Source: patterns/agent-evaluator.md on GitHub · commit 0f962e5 · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.