Agent Evaluator
A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
Problem
Without a dedicated agent-evaluator role, agent quality measurement is human-driven and bursty. The agent-evaluator pattern names this as a standing component: an agent (possibly automated, possibly LLM-driven) whose job is to test the production agent on an ongoing basis. Differs from eval-harness (offline batch) by being an active, ongoing tester; from llm-as-judge by being agent-level not output-level.
Solution
Agent-evaluator runs continuously or on a cadence. Generates test inputs from (a) a curated suite, (b) variations of production traffic, (c) synthetic edge cases. Submits to the production agent. Judges outputs (LLM-as-judge or deterministic check). Reports pass-rate metrics over time. Pair with eval-harness, llm-as-judge, dual-evaluation-offline-online, artifact-evaluation.
When to use
- Production agent whose quality must be continuously assured.
- Engineering capacity to operate the evaluator.
- Tests can be generated automatically or curated periodically.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.