I · ReasoningEmerging

ReST-EM

also known as Reinforced Self-Training, Self-Training Loop

Iterate generate → reward-filter → fine-tune to bootstrap reasoning capabilities without human-labelled data.

Context

A team wants to improve a model's performance on a reasoning task where the model is already partially competent — it gets some answers right with chain-of-thought — and where there is an automatic way to tell a right answer from a wrong one. This automatic check might be a ground-truth label, an executable test suite, or a formal verifier that says yes or no. The team has compute to spend on generating and filtering many samples, but they do not have human-written rationales or step-by-step solutions to fine-tune on.

Problem

Pure prompting on the base model has plateaued and is not improving any further. Full reinforcement learning with algorithms like PPO is unstable and expensive to set up and run. Buying or labelling supervised rationale data at scale is not affordable for this task. The team needs a training loop that can bootstrap better reasoning out of the model itself using only the reward signal they already have, without depending on human labels and without the volatility of full reinforcement learning.

Forces

  • Reward filter quality bounds learning quality.
  • Iteration count vs cost.
  • Distribution drift across iterations.

Example

A team wants a small in-house model to solve grade-school math without paying to label rationales. They run ReST-EM: sample many CoT solutions per problem, keep only those whose final answer matches ground truth, fine-tune on the kept set, then sample again. Each round yields a stronger sampler whose kept fraction grows. After three iterations the small model lands within a few points of a much larger zero-shot baseline at a fraction of inference cost.

Diagram

Solution

Therefore:

EM-style loop. (E-step) Generate many responses per problem. Filter by reward (correctness against ground truth or executable test). (M-step) Fine-tune on the filtered set. Iterate. Variants: ReST (DeepMind, RL-shaped), ReST-EM (Singh et al., expectation-maximisation framing).

What this pattern forbids. Training data is restricted to filter-passing samples; ungrounded samples are not reinforced.

The smaller patterns that complete this one —

  • generalisesSTaR BootstrappingBootstrap a model's reasoning by training it on its own correct chain-of-thought outputs.
  • usesBest-of-N SamplingSample N candidate outputs and select the highest-ranked by a reward model or scorer.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.