Instruction Fine-tune Then Judge Cycle
also known as SFT-with-LLM-judge loop, instruction-tune evaluate iterate
A closed loop for teaching a model to follow instructions, then checking the result. First, split your instruction data into three parts: training, validation, and test. Then train on the training data (this step is supervised fine-tuning, or SFT), logging the loss after each epoch on both the training and validation sets. Next, run the model on the test set to generate answers. Finally, have a local model score those answers against the reference answers, acting as a judge. The loop catches two problems on every pass. If the validation loss pulls away from the training loss, the model is overfitting. If the judge score drops, quality is slipping. So you keep the gains and roll back the slips. This is the smallest end-to-end fine-tuning workflow that checks its own quality without a human grading every step.
Methodology process overview
Intent. Iterate on instruction fine-tunes using one signal, a model-graded score on the test set, while keeping training fit and answer quality as separate readings.
When to apply. Use this when you are teaching a small or mid-size open-weight model to follow instructions and you need to iterate fast without paying a human to grade every cycle. The dataset is small, in the hundreds to low thousands of examples, so overfitting is a real risk and a held-out test set is a must. Don't apply it when your judge model is weaker than the model it is grading, because then the scores collapse toward false positives. Also skip it as the only gate when the task is safety-critical, and keep humans in the loop. One rule holds either way: a production launch still needs a human-graded acceptance set before shipping.
Inputs
- Instruction dataset — Pairs of instruction and response, split into training, validation, and test sets.
- Base or instruction-tuned checkpoint — The model you will fine-tune. Usually an open-weight base or an existing instruct version.
- Judge model — A model you can run locally that is at least as good as the model under test on the target task.
Outputs
- Fine-tuned model checkpoints — One checkpoint per epoch, saved during training, each tagged with its training and validation loss.
- Test-set responses — The answers the chosen checkpoint generated on the held-out test set.
- Aggregate judge score — The judge model's scores over the test set, both the average and per-example. This is the main number you iterate on.
Steps (5)
Curate and split the instruction dataset
Collect instruction-and-response pairs. Set a fixed split into training, validation, and test, such as 935 / 55 / 110. Version the test set so it never changes across runs.
Run SFT with per-epoch loss logging
Fine-tune with a standard supervised loss. Log the training loss and the validation loss often, so you can see early if they start pulling apart. Save a checkpoint each epoch.
Generate responses on the test set
With the chosen checkpoint, generate an answer for every test-set instruction. Save each one as a triple of instruction, reference answer, and generated answer, ready for scoring.
Score with a local LLM-as-judge
Run the judge model over every triple using a fixed scoring prompt. Output a score per example plus an overall number, for example a mean of 51.75 over 100 examples.
Diagnose and iterate
Read the loss curves to spot overfitting. Read the judge score to spot quality problems. Change exactly one thing per pass, such as the dataset, a hyperparameter, or the base model, then run the loop again.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Fix the training, validation, and test split before the first run. Moving the test set makes score comparisons meaningless.
- Loss curves tell you about fitting. Judge scores tell you about quality. You need both.
- The judge model and the scoring prompt are frozen. Change either one and all past scores stop being comparable.
- Change one thing per pass, so any score change has a single clear cause.
Known failure modes (2)
Related patterns (3)
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
- ★★Eval Harness
Run a held-out dataset against agent versions to detect regressions and measure improvement.
Related methodologies (2)
Sources (2)
Build a Large Language Model (From Scratch) — Sebastian Raschka
Ch 7 'Fine-tuning to follow instructions' “Training set length: 935, Validation set length: 55, Test set length: 110 ... Average score: 51.75”
rasbt/LLMs-from-scratch — ch07/01_main-chapter-code README (training run output)
“Training set length: 935 ... Validation set length: 55 ... Test set length: 110 ... Average score: 51.75 ... Number of scores: 110 of 110”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified