Human-Feedback Alignment With DPO
also known as direct preference optimization, DPO alignment
Tune a model toward what people prefer, using pairs of answers where one is marked better than the other. You show the model many (chosen, rejected) pairs and train it to favour the chosen ones (the method is Direct Preference Optimization, or DPO). The older way to do this involved several moving parts: train a separate reward model, then run a reinforcement-learning loop on top of it. DPO replaces all of that with one training step that looks much like ordinary supervised training. It is cheaper to run, easier to debug, and uses the same setup as the earlier fine-tuning. You run it after the model can already do the task, to shape tone, helpfulness, and when it should refuse. You do not use it to teach new skills.
Methodology process overview
Intent. Shape a model toward human preferences with one supervised-style training step on chosen and rejected answers, skipping the operational weight of the older reinforcement-learning approach.
When to apply. Use this after the first fine-tuning step, when the model can do the task but its answers are off in style, too long, unsafe, or otherwise not what you want. You also need a set of (chosen, rejected) answer pairs, gathered from human raters or from comparisons against a stronger model. Don't apply it when the model cannot really do the task yet. This method shapes preferences, it does not teach skills, so go back and fine-tune first. One exception: if you genuinely need a separate reward model, such as in exploratory reinforcement-learning research, this method is not a drop-in swap.
Inputs
- SFT checkpoint — A fine-tuned model that can already do the task. This method refines its preferences, not its skills.
- Preference dataset — Triples of (prompt, chosen answer, rejected answer), where the chosen answer is the better one on the quality you care about.
- Reference model — A frozen copy of the fine-tuned model. It acts as an anchor so the trained model is not allowed to drift too far from it (the precise name is a KL anchor).
Outputs
- Aligned model checkpoint — The trained model. Its answers lean toward the chosen ones while still staying close to the reference.
- Preference-win-rate evaluation — A score on a held-out preference set. It measures how often the trained model is preferred over the starting fine-tuned model.
Steps (5)
Start from an SFT'd model
Check that the model can already do the task. This method shapes style and preference, not skill. Running it on a base model that was never fine-tuned wastes the signal.
Assemble preference pairs
For each prompt, get a (chosen, rejected) pair. The source can be human raters, comparisons against a stronger model, or curated edits. What matters is which answer is better, not how good either answer is on its own.
Freeze a reference model
Take a snapshot of the fine-tuned model as the reference. Training is anchored to this reference, so the model is kept from drifting too far away (the anchor is a KL term).
Train with the DPO objective
Train on the preference pairs against the reference. It is one supervised-style training loop on the same setup as the earlier fine-tuning. There is no separate reward model and no reinforcement-learning loop.
Evaluate on a held-out preference set
On prompts the model has not seen, generate from both the trained model and the reference. Have a judge, a person or a strong model, pick the better answer. Report the win-rate as the headline number.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- This method shapes preferences on top of a skill the model already has. It is not a replacement for fine-tuning.
- The signal is which answer is better, not how good either answer is on its own.
- The reference model is part of the training. Without that anchor, the model drifts off into bad territory.
- Judge quality on held-out preferences, not on the training loss. A falling loss does not mean users prefer the answers.
Known failure modes (2)
Related patterns (1)
Related methodologies (2)
- SFT Then DPO Fine-tuning Workflow★
Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.
- Instruction Fine-tune Then Judge Cycle★★
Iterate on instruction fine-tunes using one signal, a model-graded score on the test set, while keeping training fit and answer quality as separate readings.
Sources (2)
Build a Large Language Model (From Scratch) — Sebastian Raschka
Ch 7 bonus / Appendix (DPO implementation) “Use human feedback to ensure your LLM follows instructions ... Bonus: Direct Preference Optimization (DPO) implementation”
rasbt/LLMs-from-scratch — ch07/04_preference-tuning-with-dpo (DPO-from-scratch bonus material)
“Direct Preference Optimization (DPO) for LLM Alignment ... Generating a Preference Dataset With Llama 3.1 70B and Ollama”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified