I · ReasoningEmerging

STaR Bootstrapping

also known as Self-Taught Reasoner, Rationale Bootstrapping

Bootstrap a model's reasoning by training it on its own correct chain-of-thought outputs.

This pattern helps complete certain larger patterns —

  • specialisesReST-EMIterate generate → reward-filter → fine-tune to bootstrap reasoning capabilities without human-labelled data.

Context

A team wants to fine-tune a model to become a better reasoner on a class of problems where chain-of-thought prompting visibly helps. They have ground-truth final answers for a training set, and they have compute to generate many model outputs. What they do not have is a dataset of human-written rationales — the step-by-step solutions a person would normally write between problem statement and final answer.

Problem

Without supervised step-by-step explanations, supervised fine-tuning for reasoning is stuck: the model can be trained to produce final answers, but not to produce the rationales that lead to those answers. At the same time, just prompting the base model with chain-of-thought has plateaued and is as good as plain prompting can make it. The team needs a way to build a training set of rationales without humans writing them, and a training loop that does not require the unstable machinery of full reinforcement learning.

Forces

  • Filter quality determines what 'correct' rationale gets reinforced.
  • Wrong rationales that produce right answers can leak in.
  • Compute cost of repeated generation + filtering.

Example

A team has a small base model that knows facts but cannot reliably reason. They prompt it with CoT to generate (rationale, answer) pairs across a dataset with ground-truth answers. They keep pairs whose answer is right; for wrong answers they 'rationalize' (give the model the right answer and ask for a rationale). They fine-tune on the kept set, then iterate. After two STaR rounds the model's reasoning capability climbs without any human-written rationales.

Diagram

Solution

Therefore:

Prompt the base model with CoT to generate rationale + answer pairs. Keep pairs where the answer matches ground truth. **Rationalization**: when a generated rationale yields the wrong answer, prompt the model with the correct answer as a hint and ask for a rationale that justifies it; add the rationalized example to training. Fine-tune on the kept + rationalized pairs. Repeat: the fine-tuned model generates better rationales next round; iterate.

What this pattern forbids. Training data is restricted to filter-passing rationales; ungrounded rationales are not reinforced.

The smaller patterns that complete this one —

  • usesChain of Thought★★Elicit multi-step reasoning by prompting the model to produce intermediate steps before its final answer.

And the patterns that stand alongside it, or against it —

  • complementsSelf-Consistency★★Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance