Instruct Dataset Generation Pipeline
also known as seven-stage instruction dataset, noisy-docs to instruct
A seven-stage pipeline that turns a messy pile of documents into a clean dataset for teaching a model to follow instructions, ready for supervised fine-tuning. The stages are: pull out candidate examples, remove duplicates, remove anything that overlaps your test set, filter for quality, check coverage, add data where it is thin, then package it. The pipeline rests on a clear finding. For a task-specific fine-tune, 100 to 100,000 carefully prepared examples beat a much larger but dirty pile. Quality, not quantity, is the lever.
Methodology process overview
Intent. Turn a raw document corpus into a clean, leak-free, well-covered instruction-tuning dataset through seven clear stages.
When to apply. Use this when preparing the instruction-tuning dataset for a supervised fine-tune of an LLM, such as a domain model, an LLM twin, or a task-specific assistant. It helps most when the starting material is messy, such as scraped pages, exported chats, or mixed-quality archives. Do not use it when you already have a hand-curated instruction set of proven quality. Two exceptions: skip the add-data stage when the corpus is large and evenly spread, and skip the overlap-removal stage only when the test set was made after the fact and shares nothing public, which is rare.
Inputs
- Source document corpus — Raw documents the team owns or has licensed, such as articles, posts, transcripts, manuals, and internal wikis.
- Evaluation set — A held-out set used to score the fine-tuned model. You need it to remove training examples that overlap it, so the training set does not leak in.
- Coverage map — The dimensions the instruction set must span, such as task types, domain subareas, instruction styles, and difficulty bands.
- Quality filter (model or rubric) — An LLM judge or a classifier that scores each candidate instruction against the rubric.
Outputs
- Instruct dataset artefact — A versioned instruction-and-response set, typically 100 to 100k samples, ready for supervised fine-tuning.
- Curation log — A full audit of the extraction rules, the dedupe rates, the overlap hits, the filter cutoffs, and the sources of any added data.
- Coverage report — Counts per dimension that confirm the dataset spans the coverage map, or flag where it falls short.
Steps (7)
Extract candidate instruction-response pairs
Go through the source corpus and create candidate pairs. For each document, prompt an LLM with templates to write instructions whose answers come from that document. The corpus is the source of truth. The LLM rewrites; it does not invent.
Deduplicate
Remove exact duplicates first, then near-duplicates by word overlap or meaning. Duplicate instructions teach the model to lean on one style and lose its range.
Decontaminate against the evaluation set
Compare candidates to the held-out test set and drop anything that overlaps. Skip this step and your scores measure leakage, not real ability. Match both sides on word sequences or meaning, and log every hit.
Quality-filter
Score every remaining candidate with an LLM judge against a rubric, covering how clear the instruction is, how correct the response is, and how well it follows the format. Keep only candidates above the cutoff. By hand, check a sample of accepted and rejected items to confirm the filter is not dropping the wrong cases.
Explore for coverage
Slice the filtered set by the coverage map and find the thin dimensions. These gaps feed the next stage. They are not a final pass-or-fail check.
Augment under-represented dimensions
For each gap, make more samples. Generate them with a prompted LLM, paraphrase existing examples, or pull more from the corpus with targeted templates. Run the new samples back through dedupe, overlap removal, and the quality filter. They pass the same gates as everything else.
Package and version
Format the data the way the fine-tuning trainer expects, usually JSONL of message lists. Compute the statistics, such as sample count, token distribution, and counts per dimension. Tag the version. Register it in the dataset store with a trail back to the source corpus and the rejected-candidate logs.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- 100 to 100k carefully prepared samples beat a large dirty corpus. Quality, not quantity, is the lever.
- Remove test-set overlap before quality filtering. A leaked test example can score high on quality and still ruin the evaluation.
- Every added sample passes the same gates as the originals. There is no special path for generated data.
- Coverage is a stage, not an afterthought. The gaps you do not flag become the model's blind spots.
Known failure modes (3)
- ✕Automating a Broken Process
Generating instructions with the model that will later be fine-tuned — the model learns to produce what it already produced, no real capability is added.
- ✕Reward Hacking
The LLM judge filtering quality is gameable by the LLM generator authoring candidates; both ends collude on a score that doesn't reflect real quality.
- ✕Errors Swept Under the Rug
Skipping the rejected-sample audit — the filter throws away exactly the cases the production model will hit, and nobody notices until launch.
Related patterns (4)
- ★Dimensional Synthetic Eval Set
Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.
- ★Streaming Feature Pipeline
Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
Related compositions (1)
Related methodologies (3)
- Dataset Curation Pipeline★★
Turn raw data into a versioned training dataset. Run it through an inspect, deduplicate, clean, filter, and format pipeline that openly trades off quality, coverage, and quantity.
- LLM Twin End-to-End Construction★
Produce a production-grade personalised LLM twin through a repeatable pipeline. The pipeline covers data collection, instruction-dataset generation, supervised fine-tuning, preference alignment, evaluation, deployment, and monitoring.
- Finetune-as-Last-Resort Escalation★★
Make teams use up prompt engineering, retrieval, and task splitting before they fine-tune, because fine-tuning is the most expensive and the hardest to undo.
Sources (3)
LLM Engineer's Handbook
Ch 3 'Data Engineering'; Ch 5 'Supervised Fine-tuning' “Task-specific models can be fine-tuned with a much smaller dataset, typically ranging from 100 to 100,000 samples”
Iusztin: Generate High-Quality Instruct Datasets for Fine-Tuning LLMs
“transform noisy documents collected from Notion and the Internet (through crawling) into a high-quality instruction dataset ... the quality of your dataset is the most critical aspect ... run multiple summaries for each document to augment…”
PacktPublishing/LLM-Engineers-Handbook (book companion repo)
“Instruct Dataset Pipeline: Generates instruction-following datasets via poetry poe run-generate-instruct-datasets-pipeline for supervised fine-tuning phases”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified