Finetune-as-Last-Resort Escalation
also known as fine-tuning escalation ladder, exhaust prompting first
Treat fine-tuning as the last step on a ladder, not the first lever you grab. The rule of thumb: fine-tuning is for form, retrieval is for facts. Climb the ladder in order. First prompt engineering, then few-shot prompting, then retrieval-augmented generation (RAG), then advanced RAG, then splitting the task or using an agent to run the steps. Only reach for fine-tuning once those steps run out. Every step before it is cheap to undo. Fine-tuning is not.
Methodology process overview
Intent. Make teams use up prompt engineering, retrieval, and task splitting before they fine-tune, because fine-tuning is the most expensive and the hardest to undo.
When to apply. Use this whenever a team is thinking about fine-tuning a model for a specific behaviour or quality. Apply it as a gate: have we tried the cheaper steps first? Don't apply it when fine-tuning is clearly the right tool, such as teaching a fixed response format, adding a new input type, or shrinking a large model into a small one. Also skip the ladder when the data clearly shows prompt engineering has stalled and the gap that remains is about form.
Inputs
- Current quality measurement — The score on your test set with the current setup, meaning the current model, prompt, and retrieval.
- Quality target — The score the system must reach to ship or to meet a promise made to stakeholders.
- Gap diagnosis — A breakdown of how the current system fails, sorted into facts, form, reasoning, retrieval, and tool use. This is how you pick the right step.
Outputs
- Escalation log — An ordered record of which steps you tried, the score each one produced, and why you needed the next step.
- Selected lever — The lowest step that closes the gap to the quality target.
Steps (6)
Diagnose the gap before climbing
Run the test set and look at how it fails. Is the model wrong about facts, which is a retrieval problem? Wrong in form, which is a fine-tuning problem? Wrong in reasoning, which is a task-splitting problem? Or wrong because a tool failed? The diagnosis picks the right step. Without it, you climb in the wrong direction.
Step 1: prompt engineering
Improve the system prompt, the output format, the role you give the model, and the structured output. Measure on the test set. Many gaps close right here. It is cheap and easy to undo.
Step 2: few-shot prompting
Add a few hand-picked examples in the prompt. They show the format and the reasoning you want. Measure again. This works well for form and style problems.
Step 3: retrieval-augmented generation
If facts are the problem, add retrieval. Start with the simplest setup and measure. Move to advanced retrieval, such as reranking, query rewriting, and layered search, only if the simple version stalls.
usesNaive RAGAgentic RAGContextual RetrievalCross-Encoder Reranking
Step 4: task decomposition or agent orchestration
If the task is too big for one call, split it. Use plan-and-execute, prompt chaining, or query decomposition. Each piece then becomes a smaller, easier problem.
usesPlan-and-ExecutePrompt ChainingQuery-Decomposition Agent
Step 5: fine-tune
Only now. Fine-tune for form, a fixed response shape, or to shrink a big model into a small one. Retrieval is still for facts, so do not try to fine-tune knowledge in. Record the scores from every earlier step. That way the data, not excitement, justifies the cost of fine-tuning.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Fine-tuning is for form. Retrieval is for facts.
- Climb in order. Cheap steps first, hard-to-undo steps last.
- Diagnose before you climb. The wrong step wastes the most expensive lever.
- Record the score at every step, so data justifies the cost of fine-tuning, not team excitement.
Known failure modes (3)
- ✕Naive-RAG-First
Skipping prompt engineering and reaching straight for retrieval — half the gap was a prompt-clarity issue that fine-tuning won't close either.
- ✕Automating a Broken Process
Fine-tuning to compensate for a broken prompt or broken retrieval pipeline — the underlying defect is now baked into the weights.
- ✕Demo-to-Production Cliff
Fine-tuning on a tiny in-house set, declaring victory in dev, then watching the model collapse on production-distribution inputs.
Related patterns (9)
- ★★Naive RAG
Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.
- ★★Agentic RAG
Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
- ★Contextual Retrieval
Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
- ★★Cross-Encoder Reranking
After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
- ★★Plan-and-Execute
Plan all the steps once with a strong model, then execute each step with a cheaper model under the plan.
- ★★Prompt Chaining
Decompose a task into a fixed sequence of LLM calls where each step's output becomes the next step's input.
- ★★Query-Decomposition Agent
An agent whose explicit job is to split an incoming user query into smaller independent sub-queries that can be answered sequentially or in parallel, then merge results.
- ★★Structured Output
Constrain the model's output to conform to a JSON Schema (or similar typed shape).
- ★★Prompt Versioning
Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
Related compositions (2)
- recipe · abstract shapeProduction LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.
- recipe · abstract shapeProduction RAG
Retrieval-grounded generation built to be defensible: hybrid retrieval, reranking, contextualised chunks, citations rendered to the user, and verification before the answer ships.
Related methodologies (3)
- Tools-First, Then RAG★
Check what shape your knowledge is in before you choose search, then pick the simplest way to reach each source.
- Evaluation-Driven Development★★
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
- Model Selection Workflow★★
Turn model selection into a repeatable four-step routine. The output is a private leaderboard and a live monitor, not a one-time decision.
Sources (2)
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified