Methodology · Data Engineeringemergingverified

Instruct Dataset Generation Pipeline

also known as seven-stage instruction dataset, noisy-docs to instruct

Applies to: llm-appagentrag-system

Tags: instruct-datasetsftfine-tuningaugmentation

A seven-stage pipeline that turns a messy pile of documents into a clean dataset for teaching a model to follow instructions, ready for supervised fine-tuning. The stages are: pull out candidate examples, remove duplicates, remove anything that overlaps your test set, filter for quality, check coverage, add data where it is thin, then package it. The pipeline rests on a clear finding. For a task-specific fine-tune, 100 to 100,000 carefully prepared examples beat a much larger but dirty pile. Quality, not quantity, is the lever.

Methodology process overview

flowchart TD src[Source document corpus] --> s1[1. Extract candidate\n(instruction, response) pairs] s1 --> s2[2. Deduplicate exact + near] s2 --> s3[3. Decontaminate against eval set] evalset[Held-out evaluation set] --> s3 s3 --> s4[4. Quality-filter via LLM judge] rubric[Quality rubric] --> s4 s4 --> s5[5. Explore for coverage] covmap[Coverage map] --> s5 s5 -->|gaps found| s6[6. Augment under-represented] s6 --> s2 s5 -->|covered| s7[7. Package + version] s7 --> art[Instruct dataset artefact\n100–100k samples] s4 --> audit[Audit accepted + rejected sample] audit --> s4

Intent. Turn a raw document corpus into a clean, leak-free, well-covered instruction-tuning dataset through seven clear stages.

When to apply. Use this when preparing the instruction-tuning dataset for a supervised fine-tune of an LLM, such as a domain model, an LLM twin, or a task-specific assistant. It helps most when the starting material is messy, such as scraped pages, exported chats, or mixed-quality archives. Do not use it when you already have a hand-curated instruction set of proven quality. Two exceptions: skip the add-data stage when the corpus is large and evenly spread, and skip the overlap-removal stage only when the test set was made after the fact and shares nothing public, which is rare.

Example scenario

A solo builder is creating an LLM-twin assistant trained to imitate the builder's own technical writing style. The raw material is a noisy corpus of 3,400 documents — blog posts, newsletter issues, internal notes, scraped web pages quoting the builder, exported chat conversations. They follow the seven-stage instruct-dataset pipeline. Stage one prompts an LLM with a template per document to author candidate (instruction, response) pairs whose responses come from the document content; this yields about 18,000 candidates. Stage two dedupes exact and near-duplicates, dropping 4,200. Stage three decontaminates against a held-out evaluation set of 200 reference (instruction, response) pairs hand-written for the project — any candidate sharing n-gram overlap with an eval item is dropped, with every hit logged. Stage four runs an LLM judge against a rubric (clarity of instruction, faithfulness of response to source, style adherence) and keeps the top 6,800 candidates above the threshold; the builder hand-audits 50 accepted and 50 rejected items and confirms the filter is not throwing away the casual-tone examples that are most stylistically valuable. Stage five slices the surviving set by the coverage map (topics, instruction styles, difficulty) and flags two underrepresented topic clusters. Stage six augments those clusters by prompted paraphrase and runs the new samples back through dedupe, decontamination, and the quality filter — they get no special path. Stage seven formats as JSONL of message lists, computes statistics (final count ~7,400 samples), tags the version, and registers the dataset alongside its full curation log. The builder honours the rule: quality, not quantity. Earlier experiments with the raw 18,000 candidates produced an SFT model that mimicked surface tics without capturing voice; the 7,400-sample curated set produced a recognisably stylistic twin.

Inputs

Source document corpus — Raw documents the team owns or has licensed, such as articles, posts, transcripts, manuals, and internal wikis.
Evaluation set — A held-out set used to score the fine-tuned model. You need it to remove training examples that overlap it, so the training set does not leak in.
Coverage map — The dimensions the instruction set must span, such as task types, domain subareas, instruction styles, and difficulty bands.
Quality filter (model or rubric) — An LLM judge or a classifier that scores each candidate instruction against the rubric.

Outputs

Instruct dataset artefact — A versioned instruction-and-response set, typically 100 to 100k samples, ready for supervised fine-tuning.
Curation log — A full audit of the extraction rules, the dedupe rates, the overlap hits, the filter cutoffs, and the sources of any added data.
Coverage report — Counts per dimension that confirm the dataset spans the coverage map, or flag where it falls short.

Steps (7)

Extract candidate instruction-response pairs
Go through the source corpus and create candidate pairs. For each document, prompt an LLM with templates to write instructions whose answers come from that document. The corpus is the source of truth. The LLM rewrites; it does not invent.
Deduplicate
Remove exact duplicates first, then near-duplicates by word overlap or meaning. Duplicate instructions teach the model to lean on one style and lose its range.
Decontaminate against the evaluation set
Compare candidates to the held-out test set and drop anything that overlaps. Skip this step and your scores measure leakage, not real ability. Match both sides on word sequences or meaning, and log every hit.
Quality-filter
Score every remaining candidate with an LLM judge against a rubric, covering how clear the instruction is, how correct the response is, and how well it follows the format. Keep only candidates above the cutoff. By hand, check a sample of accepted and rejected items to confirm the filter is not dropping the wrong cases.
usesLLM-as-Judge Frozen Rubric Reflection
Explore for coverage
Slice the filtered set by the coverage map and find the thin dimensions. These gaps feed the next stage. They are not a final pass-or-fail check.
usesDimensional Synthetic Eval Set
Augment under-represented dimensions
For each gap, make more samples. Generate them with a prompted LLM, paraphrase existing examples, or pull more from the corpus with targeted templates. Run the new samples back through dedupe, overlap removal, and the quality filter. They pass the same gates as everything else.
Package and version
Format the data the way the fine-tuning trainer expects, usually JSONL of message lists. Compute the statistics, such as sample count, token distribution, and counts per dimension. Tag the version. Register it in the dataset store with a trail back to the source corpus and the rejected-candidate logs.
usesStreaming Feature Pipeline

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

100 to 100k carefully prepared samples beat a large dirty corpus. Quality, not quantity, is the lever.
Remove test-set overlap before quality filtering. A leaked test example can score high on quality and still ruin the evaluation.
Every added sample passes the same gates as the originals. There is no special path for generated data.
Coverage is a stage, not an afterthought. The gaps you do not flag become the model's blind spots.

Known failure modes (3)

Related patterns (4)

Related compositions (1)

recipe · abstract shape
Production LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.

Related methodologies (3)

Sources (3)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (7)

Extract candidate instruction-response pairs

Deduplicate

Decontaminate against the evaluation set

Quality-filter

Explore for coverage

Augment under-represented dimensions

Package and version