Methodology · Fine-Tuning

SFT Then DPO Fine-tuning Workflow

Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.

Description

A two-stage way to fine-tune an open-weight model. Stage one teaches the model the task and how to follow instructions, by training it on instruction data (supervised fine-tuning, or SFT). Stage two refines style, helpfulness, and safety, by training it on pairs of answers marked better and worse (Direct Preference Optimization, or DPO). One config file switches between the two stages, so the same setup runs both. The order matters. Stage one installs the skill. Stage two shapes how that skill comes out. Skip or swap the stages and you get a model that either cannot do the task or sounds wrong.

When to apply

Use this when you build or customise an open-weight assistant for a domain where both the task skill and a specific tone, safety profile, or interaction style matter. You control the training stack and have, or can build, both an instruction dataset and a preference dataset. Skip it when an API instruct model already meets your needs, because then the extra pipeline is not worth it. Exceptions: regulated deployments that force the weights onto your own servers, or research into the alignment trade-offs between the two stages.

What it involves

Specify the two stages in one config
Run SFT on the instruct dataset
Evaluate the SFT checkpoint independently
Run DPO from the SFT checkpoint
Compare base, SFT, and DPO on a held-out eval

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Augmented LLM

Description

When to apply

What it involves

Related