Verification & Reflection

Process Reward Model

Train a verifier that scores each reasoning step rather than only the final answer.

Problem

Outcome-only scoring cannot tell the difference between reasoning that got to the right answer correctly and reasoning that got to the right answer by lucky shortcuts, cancelled errors, or fabricated intermediate facts. Reinforcing on outcome alone rewards those shortcuts, so the model becomes more confident in chains of thought that contain wrong intermediate steps. Later, on harder problems where the shortcut does not exist, the same kinds of wrong intermediate steps lead to wrong final answers. The team needs a feedback signal that can reject a candidate because step three is wrong, even when step five happens to land on the right number.

Solution

Collect step-level labels (correct / neutral / incorrect / hallucination) for chain-of-thought traces. Train a classifier to predict step labels. At inference, score every step; reject candidates whose intermediate steps have low scores. Powers test-time search and fine-tuning of the generator.

When to use

  • Outcome-only reward reinforces shortcut reasoning that lands on the right answer the wrong way.
  • Step-level labels (correct, neutral, incorrect, hallucination) can be collected at scale.
  • Test-time search or fine-tuning can consume step-level scores.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related