Human-Feedback Alignment With DPO
Shape a model toward human preferences with one supervised-style training step on chosen and rejected answers, skipping the operational weight of the older reinforcement-learning approach.
Description
Tune a model toward what people prefer, using pairs of answers where one is marked better than the other. You show the model many (chosen, rejected) pairs and train it to favour the chosen ones (the method is Direct Preference Optimization, or DPO). The older way to do this involved several moving parts: train a separate reward model, then run a reinforcement-learning loop on top of it. DPO replaces all of that with one training step that looks much like ordinary supervised training. It is cheaper to run, easier to debug, and uses the same setup as the earlier fine-tuning. You run it after the model can already do the task, to shape tone, helpfulness, and when it should refuse. You do not use it to teach new skills.
When to apply
Use this after the first fine-tuning step, when the model can do the task but its answers are off in style, too long, unsafe, or otherwise not what you want. You also need a set of (chosen, rejected) answer pairs, gathered from human raters or from comparisons against a stronger model. Don't apply it when the model cannot really do the task yet. This method shapes preferences, it does not teach skills, so go back and fine-tune first. One exception: if you genuinely need a separate reward model, such as in exploratory reinforcement-learning research, this method is not a drop-in swap.
What it involves
- Start from an SFT'd model
- Assemble preference pairs
- Freeze a reference model
- Train with the DPO objective
- Evaluate on a held-out preference set
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.