SFT Then DPO Fine-tuning Workflow
Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.
Description
A two-stage way to fine-tune an open-weight model. Stage one teaches the model the task and how to follow instructions, by training it on instruction data (supervised fine-tuning, or SFT). Stage two refines style, helpfulness, and safety, by training it on pairs of answers marked better and worse (Direct Preference Optimization, or DPO). One config file switches between the two stages, so the same setup runs both. The order matters. Stage one installs the skill. Stage two shapes how that skill comes out. Skip or swap the stages and you get a model that either cannot do the task or sounds wrong.
When to apply
Use this when you build or customise an open-weight assistant for a domain where both the task skill and a specific tone, safety profile, or interaction style matter. You control the training stack and have, or can build, both an instruction dataset and a preference dataset. Skip it when an API instruct model already meets your needs, because then the extra pipeline is not worth it. Exceptions: regulated deployments that force the weights onto your own servers, or research into the alignment trade-offs between the two stages.
What it involves
- Specify the two stages in one config
- Run SFT on the instruct dataset
- Evaluate the SFT checkpoint independently
- Run DPO from the SFT checkpoint
- Compare base, SFT, and DPO on a held-out eval
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.