Governance & Observability

Agent Evaluator

A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).

Problem

Without a dedicated agent-evaluator role, agent quality measurement is human-driven and bursty. The agent-evaluator pattern names this as a standing component: an agent (possibly automated, possibly LLM-driven) whose job is to test the production agent on an ongoing basis. Differs from eval-harness (offline batch) by being an active, ongoing tester; from llm-as-judge by being agent-level not output-level.

Solution

Agent-evaluator runs continuously or on a cadence. Generates test inputs from (a) a curated suite, (b) variations of production traffic, (c) synthetic edge cases. Submits to the production agent. Judges outputs (LLM-as-judge or deterministic check). Reports pass-rate metrics over time. Pair with eval-harness, llm-as-judge, dual-evaluation-offline-online, artifact-evaluation.

When to use

Production agent whose quality must be continuously assured.
Engineering capacity to operate the evaluator.
Tests can be generated automatically or curated periodically.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related