LLM-From-Scratch Build Progression
also known as build-an-LLM seven stages, Raschka build progression
A seven-stage learning path for building a working LLM from scratch on a laptop. Each stage builds on the last and produces something you can run before the next stage starts. Stage 1 builds the text-data pipeline. Stage 2 builds the attention mechanism. Stage 3 builds the full architecture. Stage 4 is pre-training. Stage 5 is supervised fine-tuning for classification. Stage 6 is supervised fine-tuning for instructions. Stage 7 is evaluation with an LLM judge. The goal is not to beat frontier models. It is to remove the black box.
Methodology process overview
Intent. Walk a practitioner through building a working LLM on a laptop in seven stages. Each stage produces something runnable, so the internals stop being a black box.
When to apply. Use this to onboard ML engineers or applied scientists onto LLM projects, to deepen the instincts of people who have only ever called APIs, or for any team that needs a real grasp of attention, tokenisation, pre-training versus fine-tuning, and instruction tuning. Do not apply it when the goal is to ship product on a deadline. This is a learning path, not a delivery one. Skip it when the team already has deep in-house expertise and the time is better spent on the application itself.
Inputs
- Laptop or modest GPU box — Hardware that can run small-scale training, such as a modern laptop with 16GB or more of RAM, and optionally a single consumer GPU.
- Public text corpus — A small public dataset, such as the works of Shakespeare, an OpenWebText snippet, or public-domain books. It only needs to be enough for teaching-scale pre-training.
- Existing pretrained weights for late stages — GPT-2 small or similar open weights to kick-start the fine-tuning stages without weeks of pre-training.
Outputs
- Working small LLM — A from-scratch GPT-style model that generates text. It has been fine-tuned for both classification and instruction-following.
- Stage-by-stage runnable code — Seven runnable artefacts. Each one shows a single capability on its own.
- Internalised mental model — A practitioner-level grasp of tokenisation, attention, pre-training loss, fine-tuning data, and LLM-judge evaluation.
Steps (7)
Stage 1: text data and tokenisation
Build the data pipeline. Write byte-pair or word-level tokenisation. Produce input-target pairs. Check it by turning tokens back into text. Skip this stage and every later debugging session turns into data archaeology.
Stage 2: attention
Write scaled dot-product attention from scratch, then multi-head attention. Check it on tiny matrices. Visualise the attention weights so the mechanism stops feeling abstract.
Stage 3: full transformer architecture
Put together the embeddings, attention blocks, feed-forward layers, residual connections, and layer normalisation. Run a forward pass on a fixed input and confirm the shapes match what you expect.
Stage 4: pre-training
Train next-token prediction on a small corpus. Watch the loss curve. Generate sample text along the way. The output is bad on purpose. This is the moment to feel what scale actually buys you.
Stage 5: SFT for classification
Fine-tune the pre-trained model for a classification task such as sentiment or topic. Confirm the model adapts. This is the smallest fine-tuning loop there is and the gentlest way into the supervised fine-tuning machinery.
Stage 6: SFT for instructions
Fine-tune on instruction-and-response pairs. The model goes from completing text to following instructions. Compare it side by side with the pre-trained model to feel the shift.
Stage 7: LLM-judge evaluation
Score the instruction-tuned model with an LLM judge against a rubric. This closes the loop on what 'better' means when there is no crisp reference answer.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Each stage produces something runnable before the next stage starts. The learning follows the build, not the textbook.
- Small scale, full coverage. Every idea used in frontier models gets exercised at laptop scale.
- Compare neighbouring stages. You feel the gap between pre-trained and instruction-tuned only by running both.
- Evaluation is the final stage, not an afterthought. The learning mirrors production.
Known failure modes (2)
Related patterns (3)
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
- ★★Structured Output
Constrain the model's output to conform to a JSON Schema (or similar typed shape).
Related methodologies (2)
- Scale-Down-to-Understand Pedagogy★
Build a laptop-scale version of the same architecture before you consume the frontier version, so the team reasons about the system instead of treating it as a black box.
- Evaluation-Driven Development★★
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
Sources (2)
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified