Methodology · Data Engineering

Dataset Curation Pipeline

Turn raw data into a versioned training dataset. Run it through an inspect, deduplicate, clean, filter, and format pipeline that openly trades off quality, coverage, and quantity.

Description

Build any ML or LLM training dataset around three goals: quality, coverage, and quantity. Run the data through a clear pipeline that inspects, deduplicates, cleans, filters, then formats it. The pipeline outputs a versioned dataset, keeps the original raw data unchanged, and records every step so you can audit it. The guiding line is that reading the data by hand gives more value for less glory than almost anything else in machine learning. This pipeline forces that reading to happen.

When to apply

Use this when preparing any real training, fine-tuning, or evaluation dataset. That covers a pre-training corpus, a fine-tuning instruction set, a preference set, or a RAG evaluation set. Run it before training, not after. Don't apply it for one-off prototypes whose dataset you will throw away, because the audit trail is wasted there. One exception: if the dataset is small enough for one engineer to read every example, you can skip the formal pipeline, but still write down that the inspection happened.

What it involves

Inspect raw data manually
Deduplicate
Clean
Filter on quality
Cover the dimensions
Format and version

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related