VII · Verification & ReflectionMature★★

Evaluator-Optimizer

also known as Generator-Critic Loop, LLM-as-Judge Refinement

One LLM generates; another evaluates and feeds back; loop until criteria are met.

This pattern helps complete certain larger patterns —

  • used-byCRAGAdd a lightweight retrieval evaluator that grades each retrieved document and triggers corrective web search on poor retrievals.
  • used-byDynamic Expert Recruitment·Generate the agent team — role descriptions and instances — at run time based on the specific task, then adjust team composition between iterations based on evaluation feedback.

Context

A team runs a generation task where the quality of a candidate can be scored against explicit criteria: unit tests pass or fail, a rubric is satisfied or not, a translation matches a glossary or it doesn't. Single-shot generation gets most cases right but plateaus below the quality bar the team needs. The team can afford to spend several model calls per output and is willing to trade latency for quality.

Problem

When generation and evaluation happen in one prompt the model has no incentive to disagree with itself: it produces a draft and then signs off on it. Single-shot generation tops out below what a loop with an explicit evaluator achieves, but a naive loop where the same prompt does both jobs collapses into self-approval and adds cost without quality. The team needs separate roles for proposing and judging, and a bounded loop between them, otherwise the system either fails to improve past one pass or runs forever chasing diminishing critique.

Forces

  • The evaluator must be calibrated; a bad judge teaches bad lessons.
  • Loop budget caps cost.
  • Generator and evaluator can collude (especially if same model, same prompt family).

Example

A code-generation agent produces a function that compiles but fails three of the team's unit tests. Single-shot generation has topped out. The team wraps the generator in an Evaluator-Optimizer loop: a second LLM (or a deterministic test runner) reads the candidate, returns specific failure feedback, and the generator revises against it. The loop runs up to five times or until tests pass. Average pass-rate on the same tasks rises substantially without changing the underlying model.

Diagram

Solution

Therefore:

Generator produces a candidate. Evaluator scores it against criteria with feedback. Generator revises with the feedback. Loop until evaluator passes or max iterations.

What this pattern forbids. Generator outputs are accepted only after the evaluator passes; an unbounded loop is forbidden by the iteration cap.

The smaller patterns that complete this one —

  • generalisesReflection★★Have the model review its own output and produce a revised version in one or more passes.
  • usesLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
  • generalisesPlanner-Generator-Evaluator Harness·Decompose a long-running job into three role-isolated agents — a Planner emitting a feature list, a Generator working one chunk per fresh context, and an Evaluator grading against a rubric without seeing the Generator's trace.

And the patterns that stand alongside it, or against it —

  • alternative-toBest-of-N SamplingSample N candidate outputs and select the highest-ranked by a reward model or scorer.
  • composes-withPlanner-Executor-ObserverAdd an explicit Observer role between Planner and Executor so progress is checked against the plan instead of trusted blindly.
  • conflicts-withSame-Model Self-CritiqueAnti-pattern: have the same model both produce an answer and critique it, expecting independence.
  • alternative-toSelf-Refine★★Iterate generate → feedback (same model) → refine until a stop criterion fires, with no separate critic model.
  • complementsVoting-Based CooperationFinalise a decision across multiple agents by collecting and tallying their votes on candidate options, so the joint output reflects collective rather than single-agent judgement.
  • alternative-toPolicy-Localizer-ValidatorSplit a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.
  • complementsBlind Grader with Isolated ContextRun an evaluator in a separately-allocated context window with access only to the artifact and the rubric, never the producing agent's reasoning trace, so the grader cannot be primed by the producer's framing.
  • complementsDarwin-Gödel Self-Rewrite·An agent rewrites its own source code, archives every successful variant, and samples mutation parents from the archive rather than the latest version, using archive diversity as stepping-stones to escape local optima.
  • alternative-toScorer Live MonitoringScore agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
  • complementsHuman ReflectionReflection loop that explicitly collects human feedback (not approval) on agent plans to improve them, distinct from approval gates where the human only says yes/no.
  • alternative-toPlanner-Executor-Verifier (PEV)Triadic specialization where a planner produces the plan, an executor runs it, and a separate verifier checks each step's effects against the original goal.
  • complementsCompound Error DegradationAnti-pattern: deploy a long-horizon agent without modelling that per-step accuracy multiplies across the trajectory.
  • complementsBayesian Bandit ExperimentationReplace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

References

Provenance