Dataset Curation Pipeline
also known as three-pillar dataset curation, QCQ dataset pipeline
Build any ML or LLM training dataset around three goals: quality, coverage, and quantity. Run the data through a clear pipeline that inspects, deduplicates, cleans, filters, then formats it. The pipeline outputs a versioned dataset, keeps the original raw data unchanged, and records every step so you can audit it. The guiding line is that reading the data by hand gives more value for less glory than almost anything else in machine learning. This pipeline forces that reading to happen.
Methodology process overview
Intent. Turn raw data into a versioned training dataset. Run it through an inspect, deduplicate, clean, filter, and format pipeline that openly trades off quality, coverage, and quantity.
When to apply. Use this when preparing any real training, fine-tuning, or evaluation dataset. That covers a pre-training corpus, a fine-tuning instruction set, a preference set, or a RAG evaluation set. Run it before training, not after. Don't apply it for one-off prototypes whose dataset you will throw away, because the audit trail is wasted there. One exception: if the dataset is small enough for one engineer to read every example, you can skip the formal pipeline, but still write down that the inspection happened.
Inputs
- Raw data sources — The source documents, transcripts, prompts, and examples. This is the raw material before any curation.
- Target task definition — What the dataset will train or evaluate. This is what sets the meaning of 'quality' and 'coverage'.
- Coverage map — A list of the dimensions the dataset must span, such as formats, edge cases, demographics, languages, and error types.
Outputs
- Versioned dataset artefact — The curated dataset, checked into a registry with a version tag, a schema, statistics, and a trail back to the raw inputs.
- Curation log — An audit trail of every step. It records what was removed, what was kept, what was generated, and the rule behind each choice.
- Preserved raw — The raw inputs, kept unchanged. This lets you re-run the pipeline and revisit the curation rules later.
Steps (6)
Inspect raw data manually
Before any automation, an engineer reads a real sample of the raw data. Most surprises about the corpus show up here, such as encoding issues, duplicates, bias, and irrelevant material. This is the step with the most value for the least glory.
Deduplicate
Remove exact duplicates first, then near-duplicates by word overlap or meaning. Duplicates inflate the apparent size and push the model toward whatever is repeated. Log the dedupe rates. High rates usually mean the sources were chosen poorly.
Clean
Strip control characters, fix encoding, repair broken structure, and redact personal data. Cleaning is mechanical and rule-based. It does not yet touch quality or coverage.
Filter on quality
Apply quality filters. Some are simple rules, such as length, vocabulary, or the share of meaningful words. Others use a model, such as a classifier or an LLM that scores each example. Keep only examples above the cutoff. Write down the cutoff, and check a hand-labelled sample to see how often the filter is wrong in each direction.
Cover the dimensions
Check the filtered set against the coverage map. For any thin dimension, either add data by generating or upsampling it, or note plainly that the model will be weak there. Coverage gaps you do not flag turn into silent failures in production.
Format and version
Convert to the training format you need, such as JSONL of messages, parquet, or tokenised shards. Compute the dataset statistics. Tag the version. Register it in the dataset store with a full trail. Keep the raw and intermediate stages so you can re-run the pipeline.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Read the data by hand. Every dataset starts with eyes on raw examples.
- Three goals, not two. Quality and quantity are easy to remember; coverage is the one people forget.
- Preserve the raw data. You can only reverse curation if the original survives.
- Keep the curation log. A dataset whose curation rules are lost cannot be fixed.
Known failure modes (3)
- ✕Automating a Broken Process
Inheriting a curation pipeline from a previous project and running it blind — the new corpus needs different filters, the old ones silently throw away the wrong examples.
- ✕Errors Swept Under the Rug
Filtering on quality without auditing the rejected examples — the filter throws away minority-class cases and the model learns the bias.
- ✕Reward Hacking
Setting an LLM-judge quality filter that the data generator can game — high scores, low real quality.
Related patterns (4)
- ★Dimensional Synthetic Eval Set
Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.
- ★Streaming Feature Pipeline
Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
Related compositions (2)
- recipe · abstract shapeProduction LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.
- recipe · abstract shapeEval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.
Related methodologies (2)
- Instruct Dataset Generation Pipeline★
Turn a raw document corpus into a clean, leak-free, well-covered instruction-tuning dataset through seven clear stages.
- Evaluation-Driven Development★★
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
Sources (2)
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified