Dataset Curation Pipeline
Turn raw data into a versioned training dataset. Run it through an inspect, deduplicate, clean, filter, and format pipeline that openly trades off quality, coverage, and quantity.
Description
Build any ML or LLM training dataset around three goals: quality, coverage, and quantity. Run the data through a clear pipeline that inspects, deduplicates, cleans, filters, then formats it. The pipeline outputs a versioned dataset, keeps the original raw data unchanged, and records every step so you can audit it. The guiding line is that reading the data by hand gives more value for less glory than almost anything else in machine learning. This pipeline forces that reading to happen.
When to apply
Use this when preparing any real training, fine-tuning, or evaluation dataset. That covers a pre-training corpus, a fine-tuning instruction set, a preference set, or a RAG evaluation set. Run it before training, not after. Don't apply it for one-off prototypes whose dataset you will throw away, because the audit trail is wasted there. One exception: if the dataset is small enough for one engineer to read every example, you can skip the formal pipeline, but still write down that the inspection happened.
What it involves
- Inspect raw data manually
- Deduplicate
- Clean
- Filter on quality
- Cover the dimensions
- Format and version
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.