Methodology · Data Engineeringemergingverified

Instruct Dataset Generation Pipeline

also known as seven-stage instruction dataset, noisy-docs to instruct

Applies to: llm-appagentrag-system

Tags: instruct-datasetsftfine-tuningaugmentation

A seven-stage pipeline that turns a messy pile of documents into a clean dataset for teaching a model to follow instructions, ready for supervised fine-tuning. The stages are: pull out candidate examples, remove duplicates, remove anything that overlaps your test set, filter for quality, check coverage, add data where it is thin, then package it. The pipeline rests on a clear finding. For a task-specific fine-tune, 100 to 100,000 carefully prepared examples beat a much larger but dirty pile. Quality, not quantity, is the lever.

Methodology process overview

Intent. Turn a raw document corpus into a clean, leak-free, well-covered instruction-tuning dataset through seven clear stages.

When to apply. Use this when preparing the instruction-tuning dataset for a supervised fine-tune of an LLM, such as a domain model, an LLM twin, or a task-specific assistant. It helps most when the starting material is messy, such as scraped pages, exported chats, or mixed-quality archives. Do not use it when you already have a hand-curated instruction set of proven quality. Two exceptions: skip the add-data stage when the corpus is large and evenly spread, and skip the overlap-removal stage only when the test set was made after the fact and shares nothing public, which is rare.

Inputs

  • Source document corpusRaw documents the team owns or has licensed, such as articles, posts, transcripts, manuals, and internal wikis.
  • Evaluation setA held-out set used to score the fine-tuned model. You need it to remove training examples that overlap it, so the training set does not leak in.
  • Coverage mapThe dimensions the instruction set must span, such as task types, domain subareas, instruction styles, and difficulty bands.
  • Quality filter (model or rubric)An LLM judge or a classifier that scores each candidate instruction against the rubric.

Outputs

  • Instruct dataset artefactA versioned instruction-and-response set, typically 100 to 100k samples, ready for supervised fine-tuning.
  • Curation logA full audit of the extraction rules, the dedupe rates, the overlap hits, the filter cutoffs, and the sources of any added data.
  • Coverage reportCounts per dimension that confirm the dataset spans the coverage map, or flag where it falls short.

Steps (7)

  1. Extract candidate instruction-response pairs

    Go through the source corpus and create candidate pairs. For each document, prompt an LLM with templates to write instructions whose answers come from that document. The corpus is the source of truth. The LLM rewrites; it does not invent.

  2. Deduplicate

    Remove exact duplicates first, then near-duplicates by word overlap or meaning. Duplicate instructions teach the model to lean on one style and lose its range.

  3. Decontaminate against the evaluation set

    Compare candidates to the held-out test set and drop anything that overlaps. Skip this step and your scores measure leakage, not real ability. Match both sides on word sequences or meaning, and log every hit.

  4. Quality-filter

    Score every remaining candidate with an LLM judge against a rubric, covering how clear the instruction is, how correct the response is, and how well it follows the format. Keep only candidates above the cutoff. By hand, check a sample of accepted and rejected items to confirm the filter is not dropping the wrong cases.

    usesLLM-as-JudgeFrozen Rubric Reflection

  5. Explore for coverage

    Slice the filtered set by the coverage map and find the thin dimensions. These gaps feed the next stage. They are not a final pass-or-fail check.

    usesDimensional Synthetic Eval Set

  6. Augment under-represented dimensions

    For each gap, make more samples. Generate them with a prompted LLM, paraphrase existing examples, or pull more from the corpus with targeted templates. Run the new samples back through dedupe, overlap removal, and the quality filter. They pass the same gates as everything else.

  7. Package and version

    Format the data the way the fine-tuning trainer expects, usually JSONL of message lists. Compute the statistics, such as sample count, token distribution, and counts per dimension. Tag the version. Register it in the dataset store with a trail back to the source corpus and the rejected-candidate logs.

    usesStreaming Feature Pipeline

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • 100 to 100k carefully prepared samples beat a large dirty corpus. Quality, not quantity, is the lever.
  • Remove test-set overlap before quality filtering. A leaked test example can score high on quality and still ruin the evaluation.
  • Every added sample passes the same gates as the originals. There is no special path for generated data.
  • Coverage is a stage, not an afterthought. The gaps you do not flag become the model's blind spots.

Known failure modes (3)

Related patterns (4)

Related compositions (1)

Related methodologies (3)

Sources (3)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified