Methodology · Data Engineeringprovenverified

Dataset Curation Pipeline

also known as three-pillar dataset curation, QCQ dataset pipeline

Applies to: llm-appagentrag-systemeval-harness

Tags: datasetcurationqualitycoverage

Build any ML or LLM training dataset around three goals: quality, coverage, and quantity. Run the data through a clear pipeline that inspects, deduplicates, cleans, filters, then formats it. The pipeline outputs a versioned dataset, keeps the original raw data unchanged, and records every step so you can audit it. The guiding line is that reading the data by hand gives more value for less glory than almost anything else in machine learning. This pipeline forces that reading to happen.

Methodology process overview

flowchart TD raw[Raw data sources] --> preserve[Preserve raw unchanged] raw --> inspect[Manual inspection of sample] inspect --> dedupe[Deduplicate exact + near] dedupe --> clean[Clean: encoding, PII, structure] clean --> qfilt[Quality filter:\nheuristic + LLM-judge] qfilt -->|rejected| audit[Audit rejected sample] audit -->|filter wrong| qfilt qfilt --> cov[Check coverage map] cov -->|gap| aug[Augment / upsample] aug --> dedupe cov -->|covered| fmt[Format + tokenize] fmt --> ver[Version + register] ver --> art[Versioned dataset artefact] qfilt --> log[Curation log] dedupe --> log

Intent. Turn raw data into a versioned training dataset. Run it through an inspect, deduplicate, clean, filter, and format pipeline that openly trades off quality, coverage, and quantity.

When to apply. Use this when preparing any real training, fine-tuning, or evaluation dataset. That covers a pre-training corpus, a fine-tuning instruction set, a preference set, or a RAG evaluation set. Run it before training, not after. Don't apply it for one-off prototypes whose dataset you will throw away, because the audit trail is wasted there. One exception: if the dataset is small enough for one engineer to read every example, you can skip the formal pipeline, but still write down that the inspection happened.

Example scenario

A team is preparing the training corpus for a domain-specialised assistant for healthcare procedural documentation. The raw sources are a mix of internal wiki exports, scraped procedure manuals, and chat transcripts — roughly 240,000 raw documents of widely varying quality. They follow the three-pillar curation pipeline. First an engineer reads a random sample of 200 documents and writes a short note: there are encoding artefacts from the wiki export, near-duplicates from versioned procedure copies, PII in the transcripts, and entire format families (regulatory citations) that are over-represented while pediatric procedures are under-represented. The pipeline then runs in order: exact-match dedupe drops 18% of documents, near-dupe dedupe drops another 11%; cleaning fixes encoding and redacts PII; an LLM-judge quality filter scores each document against a rubric and keeps the top 62%, with a hand-audited sample of 100 rejected examples confirming the filter is not throwing away pediatric content disproportionately. The coverage check against the predefined map flags pediatric procedures and one rare therapeutic category as under-represented; the team augments those by upsampling and prompted paraphrase, then runs the augmented samples back through dedupe and the quality filter so they pass the same gates. The final dataset is formatted as JSONL of messages, tagged v0.3.1, and registered with a curation log that names every transformation, every threshold, and every reviewer — so when a future audit asks why a particular family of cases is sparse, the answer is reachable rather than lost.

Inputs

Raw data sources — The source documents, transcripts, prompts, and examples. This is the raw material before any curation.
Target task definition — What the dataset will train or evaluate. This is what sets the meaning of 'quality' and 'coverage'.
Coverage map — A list of the dimensions the dataset must span, such as formats, edge cases, demographics, languages, and error types.

Outputs

Versioned dataset artefact — The curated dataset, checked into a registry with a version tag, a schema, statistics, and a trail back to the raw inputs.
Curation log — An audit trail of every step. It records what was removed, what was kept, what was generated, and the rule behind each choice.
Preserved raw — The raw inputs, kept unchanged. This lets you re-run the pipeline and revisit the curation rules later.

Steps (6)

Inspect raw data manually
Before any automation, an engineer reads a real sample of the raw data. Most surprises about the corpus show up here, such as encoding issues, duplicates, bias, and irrelevant material. This is the step with the most value for the least glory.
Deduplicate
Remove exact duplicates first, then near-duplicates by word overlap or meaning. Duplicates inflate the apparent size and push the model toward whatever is repeated. Log the dedupe rates. High rates usually mean the sources were chosen poorly.
Clean
Strip control characters, fix encoding, repair broken structure, and redact personal data. Cleaning is mechanical and rule-based. It does not yet touch quality or coverage.
Filter on quality
Apply quality filters. Some are simple rules, such as length, vocabulary, or the share of meaningful words. Others use a model, such as a classifier or an LLM that scores each example. Keep only examples above the cutoff. Write down the cutoff, and check a hand-labelled sample to see how often the filter is wrong in each direction.
usesLLM-as-Judge Frozen Rubric Reflection
Cover the dimensions
Check the filtered set against the coverage map. For any thin dimension, either add data by generating or upsampling it, or note plainly that the model will be weak there. Coverage gaps you do not flag turn into silent failures in production.
usesDimensional Synthetic Eval Set
Format and version
Convert to the training format you need, such as JSONL of messages, parquet, or tokenised shards. Compute the dataset statistics. Tag the version. Register it in the dataset store with a full trail. Keep the raw and intermediate stages so you can re-run the pipeline.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Read the data by hand. Every dataset starts with eyes on raw examples.
Three goals, not two. Quality and quantity are easy to remember; coverage is the one people forget.
Preserve the raw data. You can only reverse curation if the original survives.
Keep the curation log. A dataset whose curation rules are lost cannot be fixed.

Dataset Curation Pipeline

Methodology process overview

Steps (6)

Inspect raw data manually

Deduplicate

Clean

Filter on quality

Cover the dimensions

Format and version

Framework-specific instructions

Principles

Known failure modes (3)

Related patterns (4)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance