Self-Consistency

also known as Sample-and-Vote, Empirical Introspection, Marginalised Reasoning

Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.

This pattern helps complete certain larger patterns —

specialisesParallelization★★— Run independent LLM calls concurrently and combine results.
used-byConfidence Reporting★— Surface the agent's uncertainty about its answer alongside the answer itself.
specialisesTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.
specialisesVoting-Based Cooperation★— Finalise a decision across multiple agents by collecting and tallying their votes on candidate options, so the joint output reflects collective rather than single-agent judgement.

Context

A team uses a large language model on reasoning-heavy tasks like math word problems, multi-step logic puzzles, or multiple-choice questions where the model is mostly right but occasionally invents a wrong intermediate chain and confidently produces the wrong answer. The team can extract a comparable answer (a number, a class, a final choice) from each generation. Inference cost permits running the same prompt several times in parallel.

Problem

A single sample at zero temperature gives the model's single most likely chain of reasoning, but that chain is sometimes the wrong one and there is no way for downstream code to tell. Trying again with a different seed can produce a different answer, and the team has no principled way to decide which sample to trust. Without a way to combine multiple samples, the team either accepts whatever the first call returned or picks among samples arbitrarily. They are also missing a free signal: the spread across samples is itself informative about how confident the model should be, but a one-shot pipeline never gets to see it.

Forces

N samples cost N times more.
Aggregation logic depends on whether the answer is a class, a number, or free text.
Variance is itself signal: a high-variance question is one the model is uncertain on.

Example

A math-tutoring agent at zero temperature gives one wrong answer per ten problems and is confidently wrong every time. The team samples each problem five times at temperature 0.7, extracts the numeric answer from each, and majority-votes. The right answer is the one most chains converge on; variance across samples becomes a useful 'unsure' signal. Per-problem cost is five times higher, but accuracy on the long-tail of tricky problems climbs noticeably.

Diagram

flowchart TD Q[Same prompt] --> S1[Sample 1<br/>temp > 0] Q --> S2[Sample 2] Q --> S3[Sample N] S1 --> Ex[Extract answer] S2 --> Ex S3 --> Ex Ex --> Agg{Aggregate} Agg -->|discrete| Vote[Majority vote] Agg -->|numeric| Med[Median] Agg -->|free-form| Judge[Judge model] Vote --> Out[Final answer + variance signal] Med --> Out Judge --> Out

Solution

Therefore:

Run the same prompt N times with non-zero temperature. Extract the answer from each. Aggregate: majority vote for discrete answers, median for numeric, judge for free-form. Variance across samples is logged as a confidence signal.

What this pattern forbids. The final answer is the aggregate, not any single sample; individual samples have no authority.

And the patterns that stand alongside it, or against it —

alternative-toBest-of-N Sampling★— Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
complementsDebate·— Have multiple agents argue different positions on a question and converge through structured exchange.
complementsLanguage Agent Tree Search·— Lift the agent loop into a search tree with a learned value function and backtracking.
alternative-toMapReduce for Agents★— Split an oversize task into independent chunks, process each in parallel, then aggregate.
complementsChain of Thought★★— Elicit multi-step reasoning by prompting the model to produce intermediate steps before its final answer.
complementsChain of Verification★— Reduce hallucination by drafting an answer, generating independent verification questions, answering them in isolation, and revising.
complementsSTaR Bootstrapping★— Bootstrap a model's reasoning by training it on its own correct chain-of-thought outputs.
complementsAdaptive Branching Tree Search·— At each node of an inference-time search tree, use Thompson sampling to decide whether to deepen an existing answer or branch a fresh attempt, optionally choosing per-node which underlying LLM to invoke.
alternative-toTrajectory-Summary Test-Time Scaling·— When an agent's outputs are extended action-observation trajectories rather than short answers, scale test-time compute by compressing each rollout into a structured summary and selecting or reusing across those summaries instead of raw traces.
alternative-toConfident Inconsistency✕— Anti-pattern: in a regulated workflow the same query produces materially different outputs at different times, each looking correct and passing review, so the variance stays invisible unless outputs are deliberately re-run and compared across time.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

DSPy
first-class16 patternsOrchestration Frameworks★★ mature
dspy.majority (used as Ensemble reduce_fn) samples a module multiple times and returns the most frequently occurring completion, the classic self-consistency majority vote.

References

Self-Consistency Improves Chain of Thought Reasoning in Language Models
paper

Provenance

Source: patterns/self-consistency.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.