Methodology · Fine-Tuningemergingverified

Human-Feedback Alignment With DPO

also known as direct preference optimization, DPO alignment

Applies to: llm

Tags: dpopreference-alignmentrlhf-alternativepost-sft

Tune a model toward what people prefer, using pairs of answers where one is marked better than the other. You show the model many (chosen, rejected) pairs and train it to favour the chosen ones (the method is Direct Preference Optimization, or DPO). The older way to do this involved several moving parts: train a separate reward model, then run a reinforcement-learning loop on top of it. DPO replaces all of that with one training step that looks much like ordinary supervised training. It is cheaper to run, easier to debug, and uses the same setup as the earlier fine-tuning. You run it after the model can already do the task, to shape tone, helpfulness, and when it should refuse. You do not use it to teach new skills.

Methodology process overview

flowchart LR sft[SFT checkpoint] --> ref[Freeze copy as reference] sft --> tr[DPO trainer] pairs[Preference pairs:\n(prompt, chosen, rejected)] --> tr ref -->|KL anchor| tr tr --> aligned[DPO-aligned checkpoint] aligned --> eval[Held-out preference set] ref --> eval eval --> judge[Human or strong-LLM judge] judge --> wr[Win-rate vs reference] wr -->|win-rate > baseline| ship[Ship aligned] wr -->|no improvement| diag[Investigate preference data] diag --> pairs

Intent. Shape a model toward human preferences with one supervised-style training step on chosen and rejected answers, skipping the operational weight of the older reinforcement-learning approach.

When to apply. Use this after the first fine-tuning step, when the model can do the task but its answers are off in style, too long, unsafe, or otherwise not what you want. You also need a set of (chosen, rejected) answer pairs, gathered from human raters or from comparisons against a stronger model. Don't apply it when the model cannot really do the task yet. This method shapes preferences, it does not teach skills, so go back and fine-tune first. One exception: if you genuinely need a separate reward model, such as in exploratory reinforcement-learning research, this method is not a drop-in swap.

Example scenario

A team already has a fine-tuned checkpoint for a domain assistant. It handles the underlying Q&A task well, but it is too wordy, sometimes over-apologetic, and inconsistent in tone. Rather than run the older approach with a separate reward model and a reinforcement-learning loop, they apply DPO. Annotators get 4,000 prompts and see two answers for each, one from the fine-tuned model and one rewritten by a stronger model or a human editor. They pick the better answer, which produces 4,000 (prompt, chosen, rejected) triples. The fine-tuned checkpoint is snapshotted as the frozen reference, the anchor that keeps training in check. They run DPO on the same setup as the earlier fine-tuning, with no separate reward model and no reinforcement-learning loop, and produce an aligned checkpoint. They evaluate by win-rate on a held-out preference set of 500 prompts. A strong external model judge compares aligned answers against reference answers, one pair at a time, and reports 64% aligned-wins, 22% reference-wins, and 14% ties. Before shipping, the team checks a smaller human-graded sample to confirm that the model judge's ranking tracks the human ranking. Importantly, they did not apply DPO straight to the base model. Early experiments showed that DPO cannot teach a skill the fine-tuning step never installed. Skipping fine-tuning collapsed the model. DPO shapes how an existing skill comes out, not what the model knows how to do.

Inputs

SFT checkpoint — A fine-tuned model that can already do the task. This method refines its preferences, not its skills.
Preference dataset — Triples of (prompt, chosen answer, rejected answer), where the chosen answer is the better one on the quality you care about.
Reference model — A frozen copy of the fine-tuned model. It acts as an anchor so the trained model is not allowed to drift too far from it (the precise name is a KL anchor).

Outputs

Aligned model checkpoint — The trained model. Its answers lean toward the chosen ones while still staying close to the reference.
Preference-win-rate evaluation — A score on a held-out preference set. It measures how often the trained model is preferred over the starting fine-tuned model.

Steps (5)

Start from an SFT'd model
Check that the model can already do the task. This method shapes style and preference, not skill. Running it on a base model that was never fine-tuned wastes the signal.
Assemble preference pairs
For each prompt, get a (chosen, rejected) pair. The source can be human raters, comparisons against a stronger model, or curated edits. What matters is which answer is better, not how good either answer is on its own.
Freeze a reference model
Take a snapshot of the fine-tuned model as the reference. Training is anchored to this reference, so the model is kept from drifting too far away (the anchor is a KL term).
Train with the DPO objective
Train on the preference pairs against the reference. It is one supervised-style training loop on the same setup as the earlier fine-tuning. There is no separate reward model and no reinforcement-learning loop.
Evaluate on a held-out preference set
On prompts the model has not seen, generate from both the trained model and the reference. Have a judge, a person or a strong model, pick the better answer. Report the win-rate as the headline number.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

This method shapes preferences on top of a skill the model already has. It is not a replacement for fine-tuning.
The signal is which answer is better, not how good either answer is on its own.
The reference model is part of the training. Without that anchor, the model drifts off into bad territory.
Judge quality on held-out preferences, not on the training loss. A falling loss does not mean users prefer the answers.

Known failure modes (2)

Related patterns (1)

★★Augmented LLM
Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.

Related methodologies (2)

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (5)

Start from an SFT'd model

Assemble preference pairs

Freeze a reference model

Train with the DPO objective

Evaluate on a held-out preference set