Pretrain Then Adapt
also known as pretrain-and-fine-tune, base-then-specialize
Build one general model first, then specialise it for each task. To build the base, train it to predict the next word over a large pile of plain, unlabeled text. To specialise it, train that base further on a smaller set of labeled task data (this second step is supervised fine-tuning). The point is that one base can feed many specialised versions, such as a classifier, an instruction follower, or a domain assistant, without paying the heavy base-training cost again. It splits the work into two parts. The expensive part teaches the model general knowledge. The cheap part shapes how the model behaves. Because the cheap part is cheap, you can run it fast and once per use case.
Methodology process overview
Intent. Pay the cost of learning general language once, then spread it across many tasks by training one base and adapting it cheaply for each.
When to apply. Use this when you control the model weights and a generic instruction-tuned model is not good enough for your domain or task family. You also need either an existing base checkpoint or enough unlabeled domain text. Skip it when a hosted instruction-tuned API model already clears your quality bar, because then the adaptation cost is not worth it. Exceptions: regulated settings that force the weights to stay on your own servers, or research into how base models themselves behave.
Inputs
- Unlabeled pretraining corpus — A large body of domain or general text. You use it to train the base model, or keep training an existing one, by having it predict the next word.
- Base model checkpoint — Either a fresh model to train from scratch, or an existing open-weight base you can keep training.
- Task-specific labeled data — A smaller, curated dataset for the downstream task. For example, classification labels, instruction-and-response pairs, or domain Q&A.
Outputs
- Pretrained base model — A general-purpose checkpoint that can predict the next word. You can reuse it across many downstream tasks.
- Fine-tuned variant(s) — One or more task-specialised models, such as a classifier, an instruction follower, or a domain chatbot. Each one branches off the same base.
Steps (5)
Assemble and clean the pretraining corpus
Gather plain text that reflects both your target domain and general language. Remove duplicates. Filter out low-quality and toxic content. Then split it into tokens using a vocabulary that suits the model.
Pretrain the base with next-token prediction
Train the model on the text by having it predict the next token (this is the causal language-modeling objective). Watch how well it does on held-out text, save checkpoints regularly, and stop when that held-out loss flattens out.
Branch to specialization tasks
Freeze the base checkpoint, or at least version it. For each task, attach the right output piece and keep training on that task's data. For label prediction, attach a classification head. For instruction following, keep the language-modeling head.
Validate each variant against a task-specific eval
Give each specialised version its own test set, with metrics that fit the task, such as accuracy, F1, or an instruction-following score. How well the base predicts text does not tell you how good a variant is at its task.
Re-use the base for the next task
When a new task arrives, go back to the frozen base instead of training from scratch. Reusing the base is the whole reason this approach saves money.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Keep the two jobs apart. Learning general knowledge is one job. Shaping behaviour is another. Each can be tuned on its own.
- The base is a reusable asset, not a throwaway middle step. Version it and keep it.
- How well the base predicts text is not a task score. Every specialised version needs its own task test.
- Specialising is only cheap when the base is good. Invest in the base, then iterate fast on the task-specific parts.
Known failure modes (2)
Related patterns (1)
Related methodologies (2)
- Instruction Fine-tune Then Judge Cycle★★
Iterate on instruction fine-tunes using one signal, a model-graded score on the test set, while keeping training fit and answer quality as separate readings.
- SFT Then DPO Fine-tuning Workflow★
Take an open-weight base to a production-ready, well-behaved assistant in two clear stages, each with its own data and goal, sharing one training pipeline.
Sources (2)
Build a Large Language Model (From Scratch) — Sebastian Raschka
Ch 5 'Pretraining on Unlabeled Data' → Ch 6 'Finetuning for Text Classification' → Ch 7 'Fine-tuning to follow instructions' “Build a Large Language Model (from Scratch) takes you inside the AI black box to tinker with the internal systems that power generative AI.”
rasbt/LLMs-from-scratch — official companion repository (ch05/ch06/ch07 directories)
“Chapter 5: Pretraining on Unlabeled Data ... Chapter 6: Finetuning for Text Classification ... Chapter 7: Finetuning to Follow Instructions”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified