Methodology · Fine-Tuningemergingverified

SFT Then DPO Fine-tuning Workflow

also known as two-stage alignment, instruct-then-align

Applies to: llmllm-app

Tags: sftdpotwo-stagealignment-workflow

A two-stage way to fine-tune an open-weight model. Stage one teaches the model the task and how to follow instructions, by training it on instruction data (supervised fine-tuning, or SFT). Stage two refines style, helpfulness, and safety, by training it on pairs of answers marked better and worse (Direct Preference Optimization, or DPO). One config file switches between the two stages, so the same setup runs both. The order matters. Stage one installs the skill. Stage two shapes how that skill comes out. Skip or swap the stages and you get a model that either cannot do the task or sounds wrong.

Methodology process overview

flowchart LR yaml[YAML config:\nstage, dataset, hparams] --> entry[Training entry point] base[Open-weight base] --> stage1 inst[Instruction dataset] --> stage1[Stage 1: SFT] entry --> stage1 stage1 --> sft[SFT checkpoint] sft --> eval1{Capability eval pass?} eval1 -->|no| fixdata[Fix instruct data / hparams] fixdata --> stage1 eval1 -->|yes| stage2[Stage 2: DPO] pref[Preference dataset] --> stage2 sft -->|reference + start| stage2 entry --> stage2 stage2 --> aligned[DPO-aligned checkpoint] aligned --> compare[Compare base vs SFT vs DPO] sft --> compare base --> compare compare --> ship[Production candidate]

Intent. Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.

When to apply. Use this when you build or customise an open-weight assistant for a domain where both the task skill and a specific tone, safety profile, or interaction style matter. You control the training stack and have, or can build, both an instruction dataset and a preference dataset. Skip it when an API instruct model already meets your needs, because then the extra pipeline is not worth it. Exceptions: regulated deployments that force the weights onto your own servers, or research into the alignment trade-offs between the two stages.

Example scenario

A team is building a customised open-weight assistant for a regulated industry. They follow the two-stage SFT then DPO workflow from the LLM Engineer's Handbook. They write a single YAML config with two sections, one per stage, each pointing at its own dataset, hyperparameters, and output directory. One Python entry point reads the YAML, routes to the stage-one or stage-two trainer, and writes checkpoints into a stage-tagged directory. Stage one fine-tunes the base on an instruction dataset of 12,000 (instruction, response) pairs drawn from internal documentation. The stage-one checkpoint is evaluated on its own with a skill test, covering task accuracy and an instruction-following score, and only handed to stage two when it clearly beats the base. Stage two runs DPO from the stage-one checkpoint, using it both as the starting point and as the anchored reference. The preference dataset of 6,000 (prompt, chosen, rejected) triples is labelled by a small team of domain experts, plus rewrites from a stronger model. The team compares the base, the stage-one model, and the stage-two model on the same held-out test. Stage one delivers most of the skill gain. Stage two adds win-rate and tone consistency. In an early experiment they skipped stage one and applied DPO straight to the base. The model could not do the task, and DPO 'aligned' it to choose between equally-bad answers. That confirmed the ordering rule, and they wrote it into the team runbook as a never-do-again experiment.

Inputs

Open-weight base model — A pretrained base checkpoint ready for instruction fine-tuning.
Instruction dataset — Curated (instruction, response) pairs. Stage one uses these to install the task and instruction-following.
Preference dataset — Triples of (prompt, chosen, rejected). Stage two uses these to align style and safety.
Pipeline configuration — A YAML config that picks the stage (stage one or stage two), the hyperparameters, the dataset paths, and the output directory.

Outputs

SFT checkpoint — The instruction-tuned model from stage one. It is both the starting point for stage two and a baseline to evaluate against.
DPO-aligned checkpoint — The final aligned model from stage two, ready for production evaluation.
Stage-comparison eval report — Side-by-side scores on a held-out test, comparing the base, the stage-one model, and the stage-two model, so you can credit each gain to the right stage.

Steps (5)

Specify the two stages in one config
Write a single YAML that defines both stages, each with its datasets, hyperparameters, and outputs. One training entry point reads the config and routes to stage one or stage two.
Run SFT on the instruct dataset
Stage one: fine-tune the base on (instruction, response) pairs to install the task and instruction-following. Track the loss, save a checkpoint each epoch, and pick the best one by validation.
Evaluate the SFT checkpoint independently
Before you move on, confirm stage one actually taught the task. If the stage-one model cannot do the task, stage two will not save it. Go back and change the dataset or the hyperparameters.
Run DPO from the SFT checkpoint
Stage two: use the stage-one model as both the starting point and the reference, and train on preference pairs to refine style and safety. Keep the anchor to the reference strong enough that the model does not lose the skill.
Compare base, SFT, and DPO on a held-out eval
Run all three models on the same test set. Credit the skill gains to stage one and the style and preference gains to stage two. If stage two is not adding a measurable win-rate, look into the preference data.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Order matters. Stage one teaches the skill, stage two shapes preference. Never run them the other way round.
Each stage owns its own data and its own metric. A skill test for stage one, a preference win-rate for stage two.
The shared YAML config is the contract. It makes the workflow repeatable and the stages comparable.
Evaluate each stage on its own before chaining them. A bad stage-one model guarantees a bad stage-two result.

Known failure modes (2)

Related patterns (1)

★★Augmented LLM
Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.

Related methodologies (3)

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (5)

Specify the two stages in one config

Run SFT on the instruct dataset

Evaluate the SFT checkpoint independently

Run DPO from the SFT checkpoint

Compare base, SFT, and DPO on a held-out eval