← All booksBook VII

Verification & Reflection

Catching the model's mistakes.

29 patterns in this book. · Updated 2026-06-14

Top 5 patterns in Verification & Reflection by usage

↓ download as png

AGENT PATTERNS · BOOK VII · VERIFICATION & REFLECTION

Top 5 patterns by usage

agentpatternscatalog.org

Evaluator-Optimizer
a.k.a. Generator-Critic Loop · LLM-as-Judge Refinement
One LLM generates; another evaluates and feeds back; loop until criteria are met.
×6 compositions
Self-Refine
a.k.a. Iterative Self-Feedback
Iterate generate → feedback (same model) → refine until a stop criterion fires, with no separate critic model.
×6 compositions
Best-of-N Sampling
a.k.a. BoN · Reranking
Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
×5 compositions
Reflection
a.k.a. Self-Critique · Single-Pass Self-Review
Have the model review its own output and produce a revised version in one or more passes.
×3 compositions
diagram coming
Confidence Reporting
a.k.a. Uncertainty Surfacing · Calibrated Output
Surface the agent's uncertainty about its answer alongside the answer itself.
×3 compositions
diagram coming

When to reach for each

01. Evaluator-Optimizer One LLM generates; another evaluates and feeds back; loop until criteria are met. Best for: Single-shot generation tops out below the quality the task requires. Tradeoff: Cost = (generator + evaluator) x iterations. Watch for: Single-shot generation already meets quality targets.

02. Self-Refine Iterate generate → feedback (same model) → refine until a stop criterion fires, with no separate critic model. Best for: The same model can produce useful self-feedback against an explicit improvement target. Tradeoff: Reinforces same-model blind spots (Reflexion replication studies). Watch for: A different model family is available and would give independent critique.

03. Best-of-N Sampling Sample N candidate outputs and select the highest-ranked by a reward model or scorer. Best for: A scorer or reward model exists that ranks candidates better than the generator picks them. Tradeoff: Cost scales with N. Watch for: No reliable scorer is available to pick among candidates.

04. Reflection Have the model review its own output and produce a revised version in one or more passes. Best for: One-shot generation underuses the model and a critique pass would catch errors. Tradeoff: Diminishing returns after one or two passes. Watch for: The model already produces correct outputs in one pass.

05. Confidence Reporting Surface the agent's uncertainty about its answer alongside the answer itself. Best for: Downstream code or UI needs to distinguish 'I know' from 'I am guessing' on each answer. Tradeoff: Calibration is empirical and drifts. Watch for: Confidence labels would be ignored by both the UI and the routing layer.

Verification & Reflection

Top 5 patterns by usage

Evaluator-Optimizer

Self-Refine

Best-of-N Sampling

Reflection

Confidence Reporting

When to reach for each

All patterns in this book

Evaluator-Optimizer

Self-Refine

Best-of-N Sampling

Reflection

Confidence Reporting

Process Reward Model

Self-Modification Diff Gate

Prompt Variant Evaluation

Dimensional Synthetic Eval Set

Frozen Rubric Reflection

Tool-Augmented Self-Correction

Reflexion

Self-Consistency

Behavior-Pinning Test Before Agent Edit

Blind Grader with Isolated Context

Cross-Reflection

Deterministic-LLM Sandwich

Generator-Critic Separation

Human Reflection

Red-Team Sandbox Reproduction

Stochastic-Deterministic Boundary (SDB)

Verify-Before-Cite Resolution Gate

Agentic Context Engineering Playbook

Commitment Tracking

Darwin-Gödel Self-Rewrite

Echo Recognition

World Model as Tool

Confidence-Checking Workflow

Planner-Executor-Verifier (PEV)