Methodology · Fine-Tuningemergingverified

SFT Then DPO Fine-tuning Workflow

also known as two-stage alignment, instruct-then-align

Applies to: llmllm-app

Tags: sftdpotwo-stagealignment-workflow

A two-stage way to fine-tune an open-weight model. Stage one teaches the model the task and how to follow instructions, by training it on instruction data (supervised fine-tuning, or SFT). Stage two refines style, helpfulness, and safety, by training it on pairs of answers marked better and worse (Direct Preference Optimization, or DPO). One config file switches between the two stages, so the same setup runs both. The order matters. Stage one installs the skill. Stage two shapes how that skill comes out. Skip or swap the stages and you get a model that either cannot do the task or sounds wrong.

Methodology process overview

Intent. Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.

When to apply. Use this when you build or customise an open-weight assistant for a domain where both the task skill and a specific tone, safety profile, or interaction style matter. You control the training stack and have, or can build, both an instruction dataset and a preference dataset. Skip it when an API instruct model already meets your needs, because then the extra pipeline is not worth it. Exceptions: regulated deployments that force the weights onto your own servers, or research into the alignment trade-offs between the two stages.

Inputs

  • Open-weight base modelA pretrained base checkpoint ready for instruction fine-tuning.
  • Instruction datasetCurated (instruction, response) pairs. Stage one uses these to install the task and instruction-following.
  • Preference datasetTriples of (prompt, chosen, rejected). Stage two uses these to align style and safety.
  • Pipeline configurationA YAML config that picks the stage (stage one or stage two), the hyperparameters, the dataset paths, and the output directory.

Outputs

  • SFT checkpointThe instruction-tuned model from stage one. It is both the starting point for stage two and a baseline to evaluate against.
  • DPO-aligned checkpointThe final aligned model from stage two, ready for production evaluation.
  • Stage-comparison eval reportSide-by-side scores on a held-out test, comparing the base, the stage-one model, and the stage-two model, so you can credit each gain to the right stage.

Steps (5)

  1. Specify the two stages in one config

    Write a single YAML that defines both stages, each with its datasets, hyperparameters, and outputs. One training entry point reads the config and routes to stage one or stage two.

  2. Run SFT on the instruct dataset

    Stage one: fine-tune the base on (instruction, response) pairs to install the task and instruction-following. Track the loss, save a checkpoint each epoch, and pick the best one by validation.

  3. Evaluate the SFT checkpoint independently

    Before you move on, confirm stage one actually taught the task. If the stage-one model cannot do the task, stage two will not save it. Go back and change the dataset or the hyperparameters.

  4. Run DPO from the SFT checkpoint

    Stage two: use the stage-one model as both the starting point and the reference, and train on preference pairs to refine style and safety. Keep the anchor to the reference strong enough that the model does not lose the skill.

  5. Compare base, SFT, and DPO on a held-out eval

    Run all three models on the same test set. Credit the skill gains to stage one and the style and preference gains to stage two. If stage two is not adding a measurable win-rate, look into the preference data.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Order matters. Stage one teaches the skill, stage two shapes preference. Never run them the other way round.
  • Each stage owns its own data and its own metric. A skill test for stage one, a preference win-rate for stage two.
  • The shared YAML config is the contract. It makes the workflow repeatable and the stages comparable.
  • Evaluate each stage on its own before chaining them. A bad stage-one model guarantees a bad stage-two result.

Known failure modes (2)

Related patterns (1)

Related methodologies (3)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified