Process Reward Model
also known as PRM, Step-Level Verifier
Train a verifier that scores each reasoning step rather than only the final answer.
This pattern helps complete certain larger patterns —
- specialisesTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.
Context
A team trains or evaluates a model on multi-step reasoning tasks such as mathematics word problems, multi-hop question answering, or chains of logical deduction. The model produces a chain of intermediate steps and a final answer, and the team has been training or selecting candidates using an outcome reward model (a verifier that only scores whether the final answer is right). They also have, or could collect, human labels at the level of individual reasoning steps.
Problem
Outcome-only scoring cannot tell the difference between reasoning that got to the right answer correctly and reasoning that got to the right answer by lucky shortcuts, cancelled errors, or fabricated intermediate facts. Reinforcing on outcome alone rewards those shortcuts, so the model becomes more confident in chains of thought that contain wrong intermediate steps. Later, on harder problems where the shortcut does not exist, the same kinds of wrong intermediate steps lead to wrong final answers. The team needs a feedback signal that can reject a candidate because step three is wrong, even when step five happens to land on the right number.
Forces
- Step-level annotation is expensive (humans must label each step).
- Step boundaries vary across tasks.
- PRM and outcome reward sometimes conflict on what counts as 'correct'.
Example
A maths-reasoning agent passes most of the eval set but on inspection many traces have correct final answers reached through wrong intermediate steps — shortcuts the outcome reward model rewarded. The team trains a process-reward-model: human raters label each chain-of-thought step as correct, neutral, incorrect, or hallucinated; a classifier learns step-level scores. At inference, candidates whose intermediate steps score low are rejected even when the final answer happens to match. The agent's reasoning quality, not just its final accuracy, improves.
Diagram
Solution
Therefore:
Collect step-level labels (correct / neutral / incorrect / hallucination) for chain-of-thought traces. Train a classifier to predict step labels. At inference, score every step; reject candidates whose intermediate steps have low scores. Powers test-time search and fine-tuning of the generator.
What this pattern forbids. Final answers are accepted only when intermediate steps pass the PRM threshold.
The smaller patterns that complete this one —
- usesBest-of-N Sampling★— Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
And the patterns that stand alongside it, or against it —
- complementsLanguage Agent Tree Search·— Lift the agent loop into a search tree with a learned value function and backtracking.
- complementsAdaptive Compute Allocation★— Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
- alternative-toReward Hacking✕— Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.