Methodology · LLM-App Engineeringemergingverified

LLM-From-Scratch Build Progression

also known as build-an-LLM seven stages, Raschka build progression

Applies to: llm-app

Tags: from-scratchpedagogytransformersft

A seven-stage learning path for building a working LLM from scratch on a laptop. Each stage builds on the last and produces something you can run before the next stage starts. Stage 1 builds the text-data pipeline. Stage 2 builds the attention mechanism. Stage 3 builds the full architecture. Stage 4 is pre-training. Stage 5 is supervised fine-tuning for classification. Stage 6 is supervised fine-tuning for instructions. Stage 7 is evaluation with an LLM judge. The goal is not to beat frontier models. It is to remove the black box.

Methodology process overview

flowchart LR in1[Laptop or modest GPU] --> s1[Stage 1: text data and tokenisation] in2[Public text corpus] --> s1 s1 --> s2[Stage 2: attention] s2 --> s3[Stage 3: full transformer architecture] s3 --> s4[Stage 4: pre-training] in3[Existing pretrained weights] --> s5 s4 --> s5[Stage 5: SFT classification] s5 --> s6[Stage 6: SFT instructions] s6 --> s7[Stage 7: LLM-judge evaluation] s7 --> out1[Working small LLM] s1 --> out2[Stage-by-stage runnable code] s2 --> out2 s3 --> out2 s4 --> out2 s5 --> out2 s6 --> out2 s7 --> out2 s7 --> out3[Internalised mental model]

Intent. Walk a practitioner through building a working LLM on a laptop in seven stages. Each stage produces something runnable, so the internals stop being a black box.

When to apply. Use this to onboard ML engineers or applied scientists onto LLM projects, to deepen the instincts of people who have only ever called APIs, or for any team that needs a real grasp of attention, tokenisation, pre-training versus fine-tuning, and instruction tuning. Do not apply it when the goal is to ship product on a deadline. This is a learning path, not a delivery one. Skip it when the team already has deep in-house expertise and the time is better spent on the application itself.

Example scenario

A research engineering team at a financial-services firm had four ML engineers who had only ever used the OpenAI API. They were about to be asked to own a fine-tuning project. The head of research ran the Raschka seven-stage progression as a four-week onboarding bootcamp before any production work began. Each engineer worked through every stage on their own laptop, an M2 MacBook Pro with 32GB RAM. They used the rasbt/LLMs-from-scratch repo as scaffolding but rewrote the key components from blank notebooks. Stage 1 took everyone longer than expected. Byte-pair encoding was the first place naive intuitions broke. The stage 2 attention visualisations on Shakespeare became the team's reference point for talking about attention afterward. Stage 4 pre-training on a 5MB Shakespeare slice taught the team, in their gut, why frontier-model pretraining costs what it does. Their tiny model could complete 'To be or' but not much else after two hours of training. Running stages 5 and 6 side by side made the pre-trained-versus-instruction-tuned distinction stop being a slogan. Stage 7's LLM-judge eval became the template for the production rubric they later built. The lesson the team kept: when they later debugged a misbehaving fine-tune, they reasoned from first principles about loss curves and tokenisation instead of guessing. The four weeks paid back in the first incident.

Inputs

Laptop or modest GPU box — Hardware that can run small-scale training, such as a modern laptop with 16GB or more of RAM, and optionally a single consumer GPU.
Public text corpus — A small public dataset, such as the works of Shakespeare, an OpenWebText snippet, or public-domain books. It only needs to be enough for teaching-scale pre-training.
Existing pretrained weights for late stages — GPT-2 small or similar open weights to kick-start the fine-tuning stages without weeks of pre-training.

Outputs

Working small LLM — A from-scratch GPT-style model that generates text. It has been fine-tuned for both classification and instruction-following.
Stage-by-stage runnable code — Seven runnable artefacts. Each one shows a single capability on its own.
Internalised mental model — A practitioner-level grasp of tokenisation, attention, pre-training loss, fine-tuning data, and LLM-judge evaluation.

Steps (7)

Stage 1: text data and tokenisation
Build the data pipeline. Write byte-pair or word-level tokenisation. Produce input-target pairs. Check it by turning tokens back into text. Skip this stage and every later debugging session turns into data archaeology.
Stage 2: attention
Write scaled dot-product attention from scratch, then multi-head attention. Check it on tiny matrices. Visualise the attention weights so the mechanism stops feeling abstract.
Stage 3: full transformer architecture
Put together the embeddings, attention blocks, feed-forward layers, residual connections, and layer normalisation. Run a forward pass on a fixed input and confirm the shapes match what you expect.
Stage 4: pre-training
Train next-token prediction on a small corpus. Watch the loss curve. Generate sample text along the way. The output is bad on purpose. This is the moment to feel what scale actually buys you.
Stage 5: SFT for classification
Fine-tune the pre-trained model for a classification task such as sentiment or topic. Confirm the model adapts. This is the smallest fine-tuning loop there is and the gentlest way into the supervised fine-tuning machinery.
Stage 6: SFT for instructions
Fine-tune on instruction-and-response pairs. The model goes from completing text to following instructions. Compare it side by side with the pre-trained model to feel the shift.
Stage 7: LLM-judge evaluation
Score the instruction-tuned model with an LLM judge against a rubric. This closes the loop on what 'better' means when there is no crisp reference answer.
usesLLM-as-Judge Frozen Rubric Reflection

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Each stage produces something runnable before the next stage starts. The learning follows the build, not the textbook.
Small scale, full coverage. Every idea used in frontier models gets exercised at laptop scale.
Compare neighbouring stages. You feel the gap between pre-trained and instruction-tuned only by running both.
Evaluation is the final stage, not an afterthought. The learning mirrors production.

LLM-From-Scratch Build Progression

Methodology process overview

Steps (7)

Stage 1: text data and tokenisation

Stage 2: attention

Stage 3: full transformer architecture

Stage 4: pre-training

Stage 5: SFT for classification

Stage 6: SFT for instructions

Stage 7: LLM-judge evaluation

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (3)

Related methodologies (2)

Sources (2)

Provenance