Verification & Reflection

Self-Consistency

Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.

Problem

A single sample at zero temperature gives the model's single most likely chain of reasoning, but that chain is sometimes the wrong one and there is no way for downstream code to tell. Trying again with a different seed can produce a different answer, and the team has no principled way to decide which sample to trust. Without a way to combine multiple samples, the team either accepts whatever the first call returned or picks among samples arbitrarily. They are also missing a free signal: the spread across samples is itself informative about how confident the model should be, but a one-shot pipeline never gets to see it.

Solution

Run the same prompt N times with non-zero temperature. Extract the answer from each. Aggregate: majority vote for discrete answers, median for numeric, judge for free-form. Variance across samples is logged as a confidence signal.

When to use

Reasoning-heavy questions where the model is mostly right but sometimes invents a wrong chain.
Answers are extractable in a comparable form (discrete, numeric, or judgeable).
Cost of N samples is acceptable relative to the quality lift.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related