VII · Verification & ReflectionMature★★

Self-Consistency

also known as Sample-and-Vote, Empirical Introspection, Marginalised Reasoning

Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.

This pattern helps complete certain larger patterns —

  • specialisesParallelization★★Run independent LLM calls concurrently and combine results.
  • used-byConfidence ReportingSurface the agent's uncertainty about its answer alongside the answer itself.
  • specialisesTest-Time Compute Scaling★★Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.
  • specialisesVoting-Based CooperationFinalise a decision across multiple agents by collecting and tallying their votes on candidate options, so the joint output reflects collective rather than single-agent judgement.

Context

A team uses a large language model on reasoning-heavy tasks like math word problems, multi-step logic puzzles, or multiple-choice questions where the model is mostly right but occasionally invents a wrong intermediate chain and confidently produces the wrong answer. The team can extract a comparable answer (a number, a class, a final choice) from each generation. Inference cost permits running the same prompt several times in parallel.

Problem

A single sample at zero temperature gives the model's single most likely chain of reasoning, but that chain is sometimes the wrong one and there is no way for downstream code to tell. Trying again with a different seed can produce a different answer, and the team has no principled way to decide which sample to trust. Without a way to combine multiple samples, the team either accepts whatever the first call returned or picks among samples arbitrarily. They are also missing a free signal: the spread across samples is itself informative about how confident the model should be, but a one-shot pipeline never gets to see it.

Forces

  • N samples cost N times more.
  • Aggregation logic depends on whether the answer is a class, a number, or free text.
  • Variance is itself signal: a high-variance question is one the model is uncertain on.

Example

A math-tutoring agent at zero temperature gives one wrong answer per ten problems and is confidently wrong every time. The team samples each problem five times at temperature 0.7, extracts the numeric answer from each, and majority-votes. The right answer is the one most chains converge on; variance across samples becomes a useful 'unsure' signal. Per-problem cost is five times higher, but accuracy on the long-tail of tricky problems climbs noticeably.

Diagram

Solution

Therefore:

Run the same prompt N times with non-zero temperature. Extract the answer from each. Aggregate: majority vote for discrete answers, median for numeric, judge for free-form. Variance across samples is logged as a confidence signal.

What this pattern forbids. The final answer is the aggregate, not any single sample; individual samples have no authority.

And the patterns that stand alongside it, or against it —

  • alternative-toBest-of-N SamplingSample N candidate outputs and select the highest-ranked by a reward model or scorer.
  • complementsDebate·Have multiple agents argue different positions on a question and converge through structured exchange.
  • complementsLanguage Agent Tree Search·Lift the agent loop into a search tree with a learned value function and backtracking.
  • alternative-toMapReduce for AgentsSplit an oversize task into independent chunks, process each in parallel, then aggregate.
  • complementsChain of Thought★★Elicit multi-step reasoning by prompting the model to produce intermediate steps before its final answer.
  • complementsChain of VerificationReduce hallucination by drafting an answer, generating independent verification questions, answering them in isolation, and revising.
  • complementsSTaR BootstrappingBootstrap a model's reasoning by training it on its own correct chain-of-thought outputs.
  • complementsAdaptive Branching Tree Search·At each node of an inference-time search tree, use Thompson sampling to decide whether to deepen an existing answer or branch a fresh attempt, optionally choosing per-node which underlying LLM to invoke.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance