Methodology · Data Engineeringprovenverified

Dataset Curation Pipeline

also known as three-pillar dataset curation, QCQ dataset pipeline

Applies to: llm-appagentrag-systemeval-harness

Tags: datasetcurationqualitycoverage

Build any ML or LLM training dataset around three goals: quality, coverage, and quantity. Run the data through a clear pipeline that inspects, deduplicates, cleans, filters, then formats it. The pipeline outputs a versioned dataset, keeps the original raw data unchanged, and records every step so you can audit it. The guiding line is that reading the data by hand gives more value for less glory than almost anything else in machine learning. This pipeline forces that reading to happen.

Methodology process overview

Intent. Turn raw data into a versioned training dataset. Run it through an inspect, deduplicate, clean, filter, and format pipeline that openly trades off quality, coverage, and quantity.

When to apply. Use this when preparing any real training, fine-tuning, or evaluation dataset. That covers a pre-training corpus, a fine-tuning instruction set, a preference set, or a RAG evaluation set. Run it before training, not after. Don't apply it for one-off prototypes whose dataset you will throw away, because the audit trail is wasted there. One exception: if the dataset is small enough for one engineer to read every example, you can skip the formal pipeline, but still write down that the inspection happened.

Inputs

  • Raw data sourcesThe source documents, transcripts, prompts, and examples. This is the raw material before any curation.
  • Target task definitionWhat the dataset will train or evaluate. This is what sets the meaning of 'quality' and 'coverage'.
  • Coverage mapA list of the dimensions the dataset must span, such as formats, edge cases, demographics, languages, and error types.

Outputs

  • Versioned dataset artefactThe curated dataset, checked into a registry with a version tag, a schema, statistics, and a trail back to the raw inputs.
  • Curation logAn audit trail of every step. It records what was removed, what was kept, what was generated, and the rule behind each choice.
  • Preserved rawThe raw inputs, kept unchanged. This lets you re-run the pipeline and revisit the curation rules later.

Steps (6)

  1. Inspect raw data manually

    Before any automation, an engineer reads a real sample of the raw data. Most surprises about the corpus show up here, such as encoding issues, duplicates, bias, and irrelevant material. This is the step with the most value for the least glory.

  2. Deduplicate

    Remove exact duplicates first, then near-duplicates by word overlap or meaning. Duplicates inflate the apparent size and push the model toward whatever is repeated. Log the dedupe rates. High rates usually mean the sources were chosen poorly.

  3. Clean

    Strip control characters, fix encoding, repair broken structure, and redact personal data. Cleaning is mechanical and rule-based. It does not yet touch quality or coverage.

  4. Filter on quality

    Apply quality filters. Some are simple rules, such as length, vocabulary, or the share of meaningful words. Others use a model, such as a classifier or an LLM that scores each example. Keep only examples above the cutoff. Write down the cutoff, and check a hand-labelled sample to see how often the filter is wrong in each direction.

    usesLLM-as-JudgeFrozen Rubric Reflection

  5. Cover the dimensions

    Check the filtered set against the coverage map. For any thin dimension, either add data by generating or upsampling it, or note plainly that the model will be weak there. Coverage gaps you do not flag turn into silent failures in production.

    usesDimensional Synthetic Eval Set

  6. Format and version

    Convert to the training format you need, such as JSONL of messages, parquet, or tokenised shards. Compute the dataset statistics. Tag the version. Register it in the dataset store with a full trail. Keep the raw and intermediate stages so you can re-run the pipeline.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Read the data by hand. Every dataset starts with eyes on raw examples.
  • Three goals, not two. Quality and quantity are easy to remember; coverage is the one people forget.
  • Preserve the raw data. You can only reverse curation if the original survives.
  • Keep the curation log. A dataset whose curation rules are lost cannot be fixed.

Known failure modes (3)

Related patterns (4)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified