Methodology · Fine-Tuningprovenverified

Pretrain Then Adapt

also known as pretrain-and-fine-tune, base-then-specialize

Applies to: llmllm-app

Tags: pretrainingfine-tuningtransfer-learningbase-model

Build one general model first, then specialise it for each task. To build the base, train it to predict the next word over a large pile of plain, unlabeled text. To specialise it, train that base further on a smaller set of labeled task data (this second step is supervised fine-tuning). The point is that one base can feed many specialised versions, such as a classifier, an instruction follower, or a domain assistant, without paying the heavy base-training cost again. It splits the work into two parts. The expensive part teaches the model general knowledge. The cheap part shapes how the model behaves. Because the cheap part is cheap, you can run it fast and once per use case.

Methodology process overview

flowchart TD corpus[Unlabeled corpus] --> clean[Clean + dedupe + tokenize] clean --> pre[Pretrain: next-token prediction] pre --> base[Pretrained base checkpoint] base --> v1[Variant: classifier head] base --> v2[Variant: instruction follower] base --> v3[Variant: domain assistant] data1[Labeled classification data] --> v1 data2[Instruction/response pairs] --> v2 data3[Domain Q&A] --> v3 v1 --> eval1[Classification eval] v2 --> eval2[Instruction-following eval] v3 --> eval3[Domain eval] base -->|new task arrives| reuse[Branch again, do not retrain]

Intent. Pay the cost of learning general language once, then spread it across many tasks by training one base and adapting it cheaply for each.

When to apply. Use this when you control the model weights and a generic instruction-tuned model is not good enough for your domain or task family. You also need either an existing base checkpoint or enough unlabeled domain text. Skip it when a hosted instruction-tuned API model already clears your quality bar, because then the adaptation cost is not worth it. Exceptions: regulated settings that force the weights to stay on your own servers, or research into how base models themselves behave.

Example scenario

A team at a mid-sized enterprise has decided they need on-prem language models for three internal tasks. First, sorting inbound support emails into 14 categories. Second, an instruction-following assistant for internal Q&A. Third, a summariser for legal contracts. They follow the pretrain-then-adapt approach from Raschka's 'Build a Large Language Model (From Scratch)'. Rather than train three models from scratch, or pay for three separate API fine-tunes, they build or take an open-weight base once, then specialise it three times. Stage one: they assemble a corpus that blends public web text with the firm's internal documents, dedupe and clean it, and pretrain a small base by having it predict the next token. The held-out loss flattens, and they checkpoint the base. Stage two branches three times. For the classifier, they attach a classification head and train on 8,000 labelled emails. For the instruction follower, they keep the language-modeling head and fine-tune on 935 (instruction, response) pairs. For the contract summariser, they fine-tune on a small curated set of contract/summary pairs. Each variant gets its own task-specific test, and the pretraining loss is no longer used as a quality signal. When a fourth task arrives later, entity extraction from invoices, the team returns to the frozen base rather than retraining. That reusability is exactly the economic case for the approach.

Inputs

Unlabeled pretraining corpus — A large body of domain or general text. You use it to train the base model, or keep training an existing one, by having it predict the next word.
Base model checkpoint — Either a fresh model to train from scratch, or an existing open-weight base you can keep training.
Task-specific labeled data — A smaller, curated dataset for the downstream task. For example, classification labels, instruction-and-response pairs, or domain Q&A.

Outputs

Pretrained base model — A general-purpose checkpoint that can predict the next word. You can reuse it across many downstream tasks.
Fine-tuned variant(s) — One or more task-specialised models, such as a classifier, an instruction follower, or a domain chatbot. Each one branches off the same base.

Steps (5)

Assemble and clean the pretraining corpus
Gather plain text that reflects both your target domain and general language. Remove duplicates. Filter out low-quality and toxic content. Then split it into tokens using a vocabulary that suits the model.
Pretrain the base with next-token prediction
Train the model on the text by having it predict the next token (this is the causal language-modeling objective). Watch how well it does on held-out text, save checkpoints regularly, and stop when that held-out loss flattens out.
Branch to specialization tasks
Freeze the base checkpoint, or at least version it. For each task, attach the right output piece and keep training on that task's data. For label prediction, attach a classification head. For instruction following, keep the language-modeling head.
Validate each variant against a task-specific eval
Give each specialised version its own test set, with metrics that fit the task, such as accuracy, F1, or an instruction-following score. How well the base predicts text does not tell you how good a variant is at its task.
Re-use the base for the next task
When a new task arrives, go back to the frozen base instead of training from scratch. Reusing the base is the whole reason this approach saves money.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Keep the two jobs apart. Learning general knowledge is one job. Shaping behaviour is another. Each can be tuned on its own.
The base is a reusable asset, not a throwaway middle step. Version it and keep it.
How well the base predicts text is not a task score. Every specialised version needs its own task test.
Specialising is only cheap when the base is good. Invest in the base, then iterate fast on the task-specific parts.

Known failure modes (2)

Related patterns (1)

★★Augmented LLM
Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.

Related methodologies (2)

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (5)

Assemble and clean the pretraining corpus

Pretrain the base with next-token prediction

Branch to specialization tasks

Validate each variant against a task-specific eval

Re-use the base for the next task