Methodology · Fine-Tuningprovenverified

Instruction Fine-tune Then Judge Cycle

also known as SFT-with-LLM-judge loop, instruction-tune evaluate iterate

Applies to: llmllm-app

Tags: sftinstruction-tuningllm-as-judgeiteration-loop

A closed loop for teaching a model to follow instructions, then checking the result. First, split your instruction data into three parts: training, validation, and test. Then train on the training data (this step is supervised fine-tuning, or SFT), logging the loss after each epoch on both the training and validation sets. Next, run the model on the test set to generate answers. Finally, have a local model score those answers against the reference answers, acting as a judge. The loop catches two problems on every pass. If the validation loss pulls away from the training loss, the model is overfitting. If the judge score drops, quality is slipping. So you keep the gains and roll back the slips. This is the smallest end-to-end fine-tuning workflow that checks its own quality without a human grading every step.

Methodology process overview

flowchart TD ds[Instruction dataset] --> split[Split train/val/test\nversion test set] split --> sft[SFT with per-epoch loss logging] sft --> ck[Per-epoch checkpoints\n+ train/val loss curves] ck --> pick[Pick best checkpoint by val] pick --> gen[Generate responses on test set] gen --> trip[(instruction, reference, response)] trip --> judge[Local LLM-as-judge\nfrozen rubric prompt] judge --> score[Aggregate + per-example scores] ck --> diag{Loss curves diverge?} diag -->|yes| overfit[Diagnose overfitting] diag -->|no| score score --> iter[Change ONE variable] iter -->|new run| sft

Intent. Iterate on instruction fine-tunes using one signal, a model-graded score on the test set, while keeping training fit and answer quality as separate readings.

When to apply. Use this when you are teaching a small or mid-size open-weight model to follow instructions and you need to iterate fast without paying a human to grade every cycle. The dataset is small, in the hundreds to low thousands of examples, so overfitting is a real risk and a held-out test set is a must. Don't apply it when your judge model is weaker than the model it is grading, because then the scores collapse toward false positives. Also skip it as the only gate when the task is safety-critical, and keep humans in the loop. One rule holds either way: a production launch still needs a human-graded acceptance set before shipping.

Example scenario

An engineer is teaching an open-weight 7B base model to follow instructions. The data is a custom set of 1,100 (instruction, response) pairs collected from internal documentation Q&A. They follow Raschka's Chapter 7 closed loop. The split is fixed at 935 train, 55 validation, 110 test, with the test set committed to git and never touched between runs. The training job logs training and validation loss at every step. In run one the curves pull apart cleanly after epoch 2, which exposes overfitting. The engineer picks the epoch-2 checkpoint, generates answers for all 110 test instructions, and scores each (instruction, reference, response) triple with a locally-hosted Llama 3 70B judge model using a frozen scoring prompt. The average score is 51.75 over 100. The engineer changes exactly one thing per pass. In run two they halve the learning rate, and the judge score rises to 56.2. In run three they dedupe the dataset, and the score rises to 58.9. In run four they restructure the prompt template, the score drops to 54.1, and they revert. Because the split, the judge model, and the scoring prompt are all frozen, every score is comparable to every score before it. The engineer ships the run-three checkpoint for an internal beta, but flags clearly that the judge score is a fast iteration signal, not a release gate. A human-graded acceptance set is still run before any production exposure.

Inputs

Instruction dataset — Pairs of instruction and response, split into training, validation, and test sets.
Base or instruction-tuned checkpoint — The model you will fine-tune. Usually an open-weight base or an existing instruct version.
Judge model — A model you can run locally that is at least as good as the model under test on the target task.

Outputs

Fine-tuned model checkpoints — One checkpoint per epoch, saved during training, each tagged with its training and validation loss.
Test-set responses — The answers the chosen checkpoint generated on the held-out test set.
Aggregate judge score — The judge model's scores over the test set, both the average and per-example. This is the main number you iterate on.

Steps (5)

Curate and split the instruction dataset
Collect instruction-and-response pairs. Set a fixed split into training, validation, and test, such as 935 / 55 / 110. Version the test set so it never changes across runs.
Run SFT with per-epoch loss logging
Fine-tune with a standard supervised loss. Log the training loss and the validation loss often, so you can see early if they start pulling apart. Save a checkpoint each epoch.
Generate responses on the test set
With the chosen checkpoint, generate an answer for every test-set instruction. Save each one as a triple of instruction, reference answer, and generated answer, ready for scoring.
Score with a local LLM-as-judge
Run the judge model over every triple using a fixed scoring prompt. Output a score per example plus an overall number, for example a mean of 51.75 over 100 examples.
Diagnose and iterate
Read the loss curves to spot overfitting. Read the judge score to spot quality problems. Change exactly one thing per pass, such as the dataset, a hyperparameter, or the base model, then run the loop again.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Fix the training, validation, and test split before the first run. Moving the test set makes score comparisons meaningless.
Loss curves tell you about fitting. Judge scores tell you about quality. You need both.
The judge model and the scoring prompt are frozen. Change either one and all past scores stop being comparable.
Change one thing per pass, so any score change has a single clear cause.

Instruction Fine-tune Then Judge Cycle

Methodology process overview

Steps (5)

Curate and split the instruction dataset

Run SFT with per-epoch loss logging

Generate responses on the test set

Score with a local LLM-as-judge

Diagnose and iterate

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (3)

Related methodologies (2)

Sources (2)

Provenance