Verification & Reflection

Best-of-N Sampling

Sample N candidate outputs and select the highest-ranked by a reward model or scorer.

Problem

A single sample drawn from the model at low temperature is often acceptable but rarely the best the model can produce, and on any given prompt the team has no way to tell whether they got a good draw or a mediocre one. Increasing temperature on a single sample raises variance without raising the floor: sometimes the result is better and sometimes worse, and the team ships whichever one happens to come out. Without a selection step that compares several candidates, the model's own decoding choice is the only filter on quality.

Solution

Generate N candidates with non-zero temperature. Score each with a reward model or rule-based scorer. Return the top-1 (or top-K). BoNBoN alignment fine-tunes a model to mimic the BoN distribution directly, eliminating per-inference sampling cost.

When to use

A scorer or reward model exists that ranks candidates better than the generator picks them.
Quality lift from selecting the best of N samples justifies the N-fold inference cost.
Sampling temperature can be raised enough to produce meaningfully diverse candidates.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related