II · Planning & Control FlowExperimental·

Planner-Generator-Evaluator Harness

also known as Three-Agent Harness, GAN-Inspired Agent Architecture, Spec-Plan-Generate-Evaluate Loop

Decompose a long-running job into three role-isolated agents — a Planner emitting a feature list, a Generator working one chunk per fresh context, and an Evaluator grading against a rubric without seeing the Generator's trace.

This pattern helps complete certain larger patterns —

specialisesEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
specialisesOrchestrator-Workers★★— An orchestrator dynamically breaks a task into subtasks at runtime and delegates each to a worker LLM, then synthesises results.

Context

A team runs a coding-agent harness on multi-day creative work — building a new feature across a large application, conducting a large refactor, drafting a long design document. The job is too big to fit into a single model context window, so it has to be split across many runs. There is a clear external artefact (code, document, design) that can be evaluated on its own merits without inspecting how it was produced.

Problem

A single agent trying to do all of this in one head hits context limits within a few hours and conflates planning, generation, and self-grading; its own scratch reasoning leaks into how it judges its work. A two-role loop where one agent generates and the other critiques lets the generator read the critic's notes as hints and game them. Generic orchestrator-worker decomposition does not name a grader role with hard isolation, so quality drifts run by run and there is no fixed place to enforce the acceptance bar. The team needs a three-way split where each role's context stays small, the grader cannot be socially engineered by the generator, and the plan survives across runs.

Forces

Each role's context must stay small enough to fit, yet the overall job spans days.
The evaluator must judge the artefact, not the process, but the generator naturally wants to argue.
Plans must be machine-checkable so the generator can pick up the next chunk without re-reading the user's prompt.
Role isolation costs orchestration complexity and inter-role hand-off latency.

Example

A coding agent is asked to add OAuth support across a large web app. The Planner reads the prompt and writes feature-list.json: ten ordered chunks with acceptance criteria. The Generator boots a fresh context per chunk, edits files, exits. The Evaluator boots its own fresh context, reads only the diff and the rubric ("does it compile, do the new tests pass, are there no plaintext secrets"), and returns findings. Chunk 4 fails; the driver re-invokes the Generator with the findings but not the Evaluator's reasoning trace. Across two days the artefact converges without any one context exceeding its limit.

Diagram

flowchart TD U[User prompt] --> PL[Planner<br/>runs once] PL --> FL[(feature-list.json:<br/>ordered chunks + acceptance criteria)] DR[Driver loop] --> GEN[Generator<br/>fresh context per chunk] FL --> GEN ART[(Artefact state)] --> GEN GEN --> ART2[Artefact'] ART2 --> EV[Evaluator<br/>fresh context, rubric only] RUB[(Rubric)] --> EV EV --> FIND[(findings.json)] FIND --> DR DR -->|pass| DONE[Done] DR -->|fail| GEN

Solution

Therefore:

The Planner runs once (or rarely) and emits a structured feature-list artefact: ordered chunks, acceptance criteria, dependencies. The Generator is invoked per-chunk in a fresh context that includes only (a) the feature-list, (b) the current artefact state, and (c) the chunk to build; it produces a new artefact revision and exits. The Evaluator is invoked in its own fresh context with only the artefact and the fixed rubric; it returns pass/fail plus structured findings, and never sees the Generator's chain of thought or scratch notes. A small driver loop routes between the three: failed evaluation re-invokes the Generator with the findings as input (not the full Evaluator transcript). The fixed rubric makes Evaluator behaviour reproducible across runs.

What this pattern forbids. The Evaluator must never receive the Generator's reasoning trace or scratch context, only the artefact and the rubric; the Generator must not re-plan (any plan change goes back to the Planner); the Planner must not generate the artefact directly.

The smaller patterns that complete this one —

usesFrozen Rubric Reflection★— Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.

And the patterns that stand alongside it, or against it —

alternative-toPlanner-Executor-Observer★— Add an explicit Observer role between Planner and Executor so progress is checked against the plan instead of trusted blindly.
complementsSpec-First Agent★— Drive the agent loop from a human-authored specification document rather than free-form prompts.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

Claude Agent SDK
core18 patternsAgent SDKs★ emerging
A planner expands a short prompt into a full spec, a generator works one feature per sprint against that spec, and a separate evaluator grades each sprint against criteria the gen…

References

Provenance

Source: patterns/planner-generator-evaluator-harness.md on GitHub · commit 4314cd3 · view history
Added to catalog: 2026-05-19
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.