Methodology · LLM-App Engineeringprovenverified

Finetune-as-Last-Resort Escalation

also known as fine-tuning escalation ladder, exhaust prompting first

Applies to: llm-appagentrag-systemcoding-agent

Tags: finetuningescalationrag-vs-finetune

Treat fine-tuning as the last step on a ladder, not the first lever you grab. The rule of thumb: fine-tuning is for form, retrieval is for facts. Climb the ladder in order. First prompt engineering, then few-shot prompting, then retrieval-augmented generation (RAG), then advanced RAG, then splitting the task or using an agent to run the steps. Only reach for fine-tuning once those steps run out. Every step before it is cheap to undo. Fine-tuning is not.

Methodology process overview

Intent. Make teams use up prompt engineering, retrieval, and task splitting before they fine-tune, because fine-tuning is the most expensive and the hardest to undo.

When to apply. Use this whenever a team is thinking about fine-tuning a model for a specific behaviour or quality. Apply it as a gate: have we tried the cheaper steps first? Don't apply it when fine-tuning is clearly the right tool, such as teaching a fixed response format, adding a new input type, or shrinking a large model into a small one. Also skip the ladder when the data clearly shows prompt engineering has stalled and the gap that remains is about form.

Example scenario

A legaltech team was building a contract-clause classifier. It hit 78% accuracy on the test set and needed 92% to ship. The first instinct was to fine-tune a base model on their 5,000 labelled clauses. Instead, the team lead ran the gap diagnosis. Half the failures were form errors, where the model wrote prose when JSON was expected. A quarter were fact errors, where the model did not know jurisdiction-specific clauses. The last quarter were real classification errors. Rung 1 was a tighter system prompt that forced structured output. It closed the form errors and lifted accuracy to 86%. Step 2 added six hand-picked few-shot examples covering the trickiest clause types and pushed accuracy to 89%. Step 3 added retrieval over the firm's jurisdiction handbook for citation grounding. That lifted accuracy to 91.5% and removed the fact errors. The team paused before step 4 and saw they were within noise of the target. They shipped without fine-tuning. The escalation log showed each step's score, the total cost of 4 days of engineering and zero GPU spend, and a written decision to revisit only if accuracy dropped below 90 in production. Six months later it held. The retrospective was clear: had they jumped straight to fine-tuning, they would have spent four weeks and baked the form defect into the weights.

Inputs

Current quality measurement — The score on your test set with the current setup, meaning the current model, prompt, and retrieval.
Quality target — The score the system must reach to ship or to meet a promise made to stakeholders.
Gap diagnosis — A breakdown of how the current system fails, sorted into facts, form, reasoning, retrieval, and tool use. This is how you pick the right step.

Outputs

Escalation log — An ordered record of which steps you tried, the score each one produced, and why you needed the next step.
Selected lever — The lowest step that closes the gap to the quality target.

Steps (6)

Diagnose the gap before climbing
Run the test set and look at how it fails. Is the model wrong about facts, which is a retrieval problem? Wrong in form, which is a fine-tuning problem? Wrong in reasoning, which is a task-splitting problem? Or wrong because a tool failed? The diagnosis picks the right step. Without it, you climb in the wrong direction.
Step 1: prompt engineering
Improve the system prompt, the output format, the role you give the model, and the structured output. Measure on the test set. Many gaps close right here. It is cheap and easy to undo.
usesStructured Output Prompt Versioning
Step 2: few-shot prompting
Add a few hand-picked examples in the prompt. They show the format and the reasoning you want. Measure again. This works well for form and style problems.
Step 3: retrieval-augmented generation
If facts are the problem, add retrieval. Start with the simplest setup and measure. Move to advanced retrieval, such as reranking, query rewriting, and layered search, only if the simple version stalls.
usesNaive RAG Agentic RAG Contextual Retrieval Cross-Encoder Reranking
Step 4: task decomposition or agent orchestration
If the task is too big for one call, split it. Use plan-and-execute, prompt chaining, or query decomposition. Each piece then becomes a smaller, easier problem.
usesPlan-and-Execute Prompt Chaining Query-Decomposition Agent
Step 5: fine-tune
Only now. Fine-tune for form, a fixed response shape, or to shrink a big model into a small one. Retrieval is still for facts, so do not try to fine-tune knowledge in. Record the scores from every earlier step. That way the data, not excitement, justifies the cost of fine-tuning.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Fine-tuning is for form. Retrieval is for facts.
Climb in order. Cheap steps first, hard-to-undo steps last.
Diagnose before you climb. The wrong step wastes the most expensive lever.
Record the score at every step, so data justifies the cost of fine-tuning, not team excitement.

Finetune-as-Last-Resort Escalation

Methodology process overview

Steps (6)

Diagnose the gap before climbing

Step 1: prompt engineering

Step 2: few-shot prompting

Step 3: retrieval-augmented generation

Step 4: task decomposition or agent orchestration

Step 5: fine-tune

Framework-specific instructions

Principles

Known failure modes (3)

Related patterns (9)

Related compositions (2)

Related methodologies (3)

Sources (2)

Provenance