SFT Then DPO Fine-tuning Workflow
also known as two-stage alignment, instruct-then-align
A two-stage way to fine-tune an open-weight model. Stage one teaches the model the task and how to follow instructions, by training it on instruction data (supervised fine-tuning, or SFT). Stage two refines style, helpfulness, and safety, by training it on pairs of answers marked better and worse (Direct Preference Optimization, or DPO). One config file switches between the two stages, so the same setup runs both. The order matters. Stage one installs the skill. Stage two shapes how that skill comes out. Skip or swap the stages and you get a model that either cannot do the task or sounds wrong.
Methodology process overview
Intent. Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.
When to apply. Use this when you build or customise an open-weight assistant for a domain where both the task skill and a specific tone, safety profile, or interaction style matter. You control the training stack and have, or can build, both an instruction dataset and a preference dataset. Skip it when an API instruct model already meets your needs, because then the extra pipeline is not worth it. Exceptions: regulated deployments that force the weights onto your own servers, or research into the alignment trade-offs between the two stages.
Inputs
- Open-weight base model — A pretrained base checkpoint ready for instruction fine-tuning.
- Instruction dataset — Curated (instruction, response) pairs. Stage one uses these to install the task and instruction-following.
- Preference dataset — Triples of (prompt, chosen, rejected). Stage two uses these to align style and safety.
- Pipeline configuration — A YAML config that picks the stage (stage one or stage two), the hyperparameters, the dataset paths, and the output directory.
Outputs
- SFT checkpoint — The instruction-tuned model from stage one. It is both the starting point for stage two and a baseline to evaluate against.
- DPO-aligned checkpoint — The final aligned model from stage two, ready for production evaluation.
- Stage-comparison eval report — Side-by-side scores on a held-out test, comparing the base, the stage-one model, and the stage-two model, so you can credit each gain to the right stage.
Steps (5)
Specify the two stages in one config
Write a single YAML that defines both stages, each with its datasets, hyperparameters, and outputs. One training entry point reads the config and routes to stage one or stage two.
Run SFT on the instruct dataset
Stage one: fine-tune the base on (instruction, response) pairs to install the task and instruction-following. Track the loss, save a checkpoint each epoch, and pick the best one by validation.
Evaluate the SFT checkpoint independently
Before you move on, confirm stage one actually taught the task. If the stage-one model cannot do the task, stage two will not save it. Go back and change the dataset or the hyperparameters.
Run DPO from the SFT checkpoint
Stage two: use the stage-one model as both the starting point and the reference, and train on preference pairs to refine style and safety. Keep the anchor to the reference strong enough that the model does not lose the skill.
Compare base, SFT, and DPO on a held-out eval
Run all three models on the same test set. Credit the skill gains to stage one and the style and preference gains to stage two. If stage two is not adding a measurable win-rate, look into the preference data.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Order matters. Stage one teaches the skill, stage two shapes preference. Never run them the other way round.
- Each stage owns its own data and its own metric. A skill test for stage one, a preference win-rate for stage two.
- The shared YAML config is the contract. It makes the workflow repeatable and the stages comparable.
- Evaluate each stage on its own before chaining them. A bad stage-one model guarantees a bad stage-two result.
Known failure modes (2)
Related patterns (1)
Related methodologies (3)
- Human-Feedback Alignment With DPO★
Shape a model toward human preferences with one supervised-style training step on chosen and rejected answers, skipping the operational weight of the older reinforcement-learning approach.
- Pretrain Then Adapt★★
Pay the cost of learning general language once, then spread it across many tasks by training one base and adapting it cheaply for each.
- Instruction Fine-tune Then Judge Cycle★★
Iterate on instruction fine-tunes using one signal, a model-graded score on the test set, while keeping training fit and answer quality as separate readings.
Sources (2)
LLM Engineer's Handbook — Paul Iusztin & Maxime Labonne (Packt, 2024, ISBN 9781836200079)
Ch 5 'Supervised Fine-Tuning' (creating an instruction dataset, exploring SFT and its techniques, chat templates, parameter-efficient approaches); Ch 6 'Fine-Tuning with Preference Alignment' (preference datasets, RLHF, Direct Preference Optimization) “Chapter 5: Supervised Fine-Tuning ... Chapter 6: Fine-Tuning with Preference Alignment”
PacktPublishing/LLM-Engineers-Handbook — official companion repo (YAML-configured SFT and DPO pipelines)
“SFT fine-tuning Llamma 3.1 ... change finetuning_type to dpo ... configs/training.yaml”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified