Best-of-N Sampling
also known as BoN, Reranking, BoNBoN Alignment
Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
This pattern helps complete certain larger patterns —
- specialisesParallelization★★— Run independent LLM calls concurrently and combine results.
- specialisesTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.
- used-byProcess Reward Model★— Train a verifier that scores each reasoning step rather than only the final answer.
- used-byReST-EM★— Iterate generate → reward-filter → fine-tune to bootstrap reasoning capabilities without human-labelled data.
- specialisesAdaptive Branching Tree Search·— At each node of an inference-time search tree, use Thompson sampling to decide whether to deepen an existing answer or branch a fresh attempt, optionally choosing per-node which underlying LLM to invoke.
Context
A team runs a large language model on a task where the quality of any single output varies noticeably from sample to sample, such as a code-review summary, a translation, or a customer reply. They have a way to rank candidate outputs against each other, either a trained reward model that scores responses or a rule-based scorer that approximates one. Inference cost is high enough to matter but not so high that running the model a few extra times for the same prompt is prohibitive.
Problem
A single sample drawn from the model at low temperature is often acceptable but rarely the best the model can produce, and on any given prompt the team has no way to tell whether they got a good draw or a mediocre one. Increasing temperature on a single sample raises variance without raising the floor: sometimes the result is better and sometimes worse, and the team ships whichever one happens to come out. Without a selection step that compares several candidates, the model's own decoding choice is the only filter on quality.
Forces
- N candidates cost N inferences.
- Reward-model quality bounds achievable improvement.
- Diversity across candidates is needed; identical samples defeat the pattern.
Example
A code-review assistant generates a one-paragraph summary for each pull request, and roughly one in five reads awkwardly. The team enables Best-of-N: for each PR, the model samples five candidate summaries with temperature 0.7, and a small reward model trained on past human-edited summaries picks the highest-rated one to display. Token cost goes up about five times for that step, but the rate of summaries that reviewers feel compelled to rewrite drops sharply.
Diagram
Solution
Therefore:
Generate N candidates with non-zero temperature. Score each with a reward model or rule-based scorer. Return the top-1 (or top-K). BoNBoN alignment fine-tunes a model to mimic the BoN distribution directly, eliminating per-inference sampling cost.
What this pattern forbids. The chosen output must be from the candidate set; no synthesis across candidates.
And the patterns that stand alongside it, or against it —
- alternative-toSelf-Consistency★★— Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.
- alternative-toEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
- complementsAutomatic Workflow Search·— Treat the agent's workflow (a graph of LLM-invoking nodes) as an artefact to search; use Monte Carlo Tree Search guided by an eval benchmark to discover the best workflow, then deploy it.
- alternative-toVoting-Based Cooperation★— Finalise a decision across multiple agents by collecting and tallying their votes on candidate options, so the joint output reflects collective rather than single-agent judgement.
- alternative-toParallel-Voice Proposer·— Generate several candidate thoughts in parallel under named voices and have the same model pick the canonical one, logging the losers as audit.
- complementsMulti-Path Plan Generator★★— Generate multiple candidate next-steps at each plan node enabling later selection — the planning generator pattern paired with tree-of-thoughts / LATS-style search.
- complementsGenerate-and-Test Strategy★— Generate multiple candidate solutions in parallel, then systematically test each against declared constraints rather than committing to the first plausible one — adapted from Langley & Simon's cognitive-science research on human expert problem-solving.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.