VII · Verification & ReflectionEmerging★

Process Reward Model

also known as PRM, Step-Level Verifier

Train a verifier that scores each reasoning step rather than only the final answer.

This pattern helps complete certain larger patterns —

specialisesTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.

Context

A team trains or evaluates a model on multi-step reasoning tasks such as mathematics word problems, multi-hop question answering, or chains of logical deduction. The model produces a chain of intermediate steps and a final answer, and the team has been training or selecting candidates using an outcome reward model (a verifier that only scores whether the final answer is right). They also have, or could collect, human labels at the level of individual reasoning steps.

Problem

Outcome-only scoring cannot tell the difference between reasoning that got to the right answer correctly and reasoning that got to the right answer by lucky shortcuts, cancelled errors, or fabricated intermediate facts. Reinforcing on outcome alone rewards those shortcuts, so the model becomes more confident in chains of thought that contain wrong intermediate steps. Later, on harder problems where the shortcut does not exist, the same kinds of wrong intermediate steps lead to wrong final answers. The team needs a feedback signal that can reject a candidate because step three is wrong, even when step five happens to land on the right number.

Forces

Step-level annotation is expensive (humans must label each step).
Step boundaries vary across tasks.
PRM and outcome reward sometimes conflict on what counts as 'correct'.

Example

A maths-reasoning agent passes most of the eval set but on inspection many traces have correct final answers reached through wrong intermediate steps — shortcuts the outcome reward model rewarded. The team trains a process-reward-model: human raters label each chain-of-thought step as correct, neutral, incorrect, or hallucinated; a classifier learns step-level scores. At inference, candidates whose intermediate steps score low are rejected even when the final answer happens to match. The agent's reasoning quality, not just its final accuracy, improves.

Diagram

flowchart TD CoT[Reasoning trace] --> S1[Step 1] --> S2[Step 2] --> S3[Step 3] --> Ans[Final answer] S1 --> PRM[Process reward model] S2 --> PRM S3 --> PRM PRM -->|low score| Reject[Reject candidate] PRM -->|all high| Accept[Accept]

Solution

Therefore:

Collect step-level labels (correct / neutral / incorrect / hallucination) for chain-of-thought traces. Train a classifier to predict step labels. At inference, score every step; reject candidates whose intermediate steps have low scores. Powers test-time search and fine-tuning of the generator.

What this pattern forbids. Final answers are accepted only when intermediate steps pass the PRM threshold.

The smaller patterns that complete this one —

usesBest-of-N Sampling★— Sample N candidate outputs and select the highest-ranked by a reward model or scorer.

And the patterns that stand alongside it, or against it —

complementsLanguage Agent Tree Search·— Lift the agent loop into a search tree with a learned value function and backtracking.
complementsAdaptive Compute Allocation★— Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
alternative-toReward Hacking✕— Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Reflection & Self-Correction
optional

Used in frameworks

References

Let's Verify Step by Step
paper

Provenance

Source: patterns/process-reward-model.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-22
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.