Methodology · Fine-Tuningprovenverified

Pretrain Then Adapt

also known as pretrain-and-fine-tune, base-then-specialize

Applies to: llmllm-app

Tags: pretrainingfine-tuningtransfer-learningbase-model

Build one general model first, then specialise it for each task. To build the base, train it to predict the next word over a large pile of plain, unlabeled text. To specialise it, train that base further on a smaller set of labeled task data (this second step is supervised fine-tuning). The point is that one base can feed many specialised versions, such as a classifier, an instruction follower, or a domain assistant, without paying the heavy base-training cost again. It splits the work into two parts. The expensive part teaches the model general knowledge. The cheap part shapes how the model behaves. Because the cheap part is cheap, you can run it fast and once per use case.

Methodology process overview

Intent. Pay the cost of learning general language once, then spread it across many tasks by training one base and adapting it cheaply for each.

When to apply. Use this when you control the model weights and a generic instruction-tuned model is not good enough for your domain or task family. You also need either an existing base checkpoint or enough unlabeled domain text. Skip it when a hosted instruction-tuned API model already clears your quality bar, because then the adaptation cost is not worth it. Exceptions: regulated settings that force the weights to stay on your own servers, or research into how base models themselves behave.

Inputs

  • Unlabeled pretraining corpusA large body of domain or general text. You use it to train the base model, or keep training an existing one, by having it predict the next word.
  • Base model checkpointEither a fresh model to train from scratch, or an existing open-weight base you can keep training.
  • Task-specific labeled dataA smaller, curated dataset for the downstream task. For example, classification labels, instruction-and-response pairs, or domain Q&A.

Outputs

  • Pretrained base modelA general-purpose checkpoint that can predict the next word. You can reuse it across many downstream tasks.
  • Fine-tuned variant(s)One or more task-specialised models, such as a classifier, an instruction follower, or a domain chatbot. Each one branches off the same base.

Steps (5)

  1. Assemble and clean the pretraining corpus

    Gather plain text that reflects both your target domain and general language. Remove duplicates. Filter out low-quality and toxic content. Then split it into tokens using a vocabulary that suits the model.

  2. Pretrain the base with next-token prediction

    Train the model on the text by having it predict the next token (this is the causal language-modeling objective). Watch how well it does on held-out text, save checkpoints regularly, and stop when that held-out loss flattens out.

  3. Branch to specialization tasks

    Freeze the base checkpoint, or at least version it. For each task, attach the right output piece and keep training on that task's data. For label prediction, attach a classification head. For instruction following, keep the language-modeling head.

  4. Validate each variant against a task-specific eval

    Give each specialised version its own test set, with metrics that fit the task, such as accuracy, F1, or an instruction-following score. How well the base predicts text does not tell you how good a variant is at its task.

  5. Re-use the base for the next task

    When a new task arrives, go back to the frozen base instead of training from scratch. Reusing the base is the whole reason this approach saves money.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Keep the two jobs apart. Learning general knowledge is one job. Shaping behaviour is another. Each can be tuned on its own.
  • The base is a reusable asset, not a throwaway middle step. Version it and keep it.
  • How well the base predicts text is not a task score. Every specialised version needs its own task test.
  • Specialising is only cheap when the base is good. Invest in the base, then iterate fast on the task-specific parts.

Known failure modes (2)

Related patterns (1)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified