Methodology · LLM-App Engineeringprovenverified

Finetune-as-Last-Resort Escalation

also known as fine-tuning escalation ladder, exhaust prompting first

Applies to: llm-appagentrag-systemcoding-agent

Tags: finetuningescalationrag-vs-finetune

Treat fine-tuning as the last step on a ladder, not the first lever you grab. The rule of thumb: fine-tuning is for form, retrieval is for facts. Climb the ladder in order. First prompt engineering, then few-shot prompting, then retrieval-augmented generation (RAG), then advanced RAG, then splitting the task or using an agent to run the steps. Only reach for fine-tuning once those steps run out. Every step before it is cheap to undo. Fine-tuning is not.

Methodology process overview

Intent. Make teams use up prompt engineering, retrieval, and task splitting before they fine-tune, because fine-tuning is the most expensive and the hardest to undo.

When to apply. Use this whenever a team is thinking about fine-tuning a model for a specific behaviour or quality. Apply it as a gate: have we tried the cheaper steps first? Don't apply it when fine-tuning is clearly the right tool, such as teaching a fixed response format, adding a new input type, or shrinking a large model into a small one. Also skip the ladder when the data clearly shows prompt engineering has stalled and the gap that remains is about form.

Inputs

  • Current quality measurementThe score on your test set with the current setup, meaning the current model, prompt, and retrieval.
  • Quality targetThe score the system must reach to ship or to meet a promise made to stakeholders.
  • Gap diagnosisA breakdown of how the current system fails, sorted into facts, form, reasoning, retrieval, and tool use. This is how you pick the right step.

Outputs

  • Escalation logAn ordered record of which steps you tried, the score each one produced, and why you needed the next step.
  • Selected leverThe lowest step that closes the gap to the quality target.

Steps (6)

  1. Diagnose the gap before climbing

    Run the test set and look at how it fails. Is the model wrong about facts, which is a retrieval problem? Wrong in form, which is a fine-tuning problem? Wrong in reasoning, which is a task-splitting problem? Or wrong because a tool failed? The diagnosis picks the right step. Without it, you climb in the wrong direction.

  2. Step 1: prompt engineering

    Improve the system prompt, the output format, the role you give the model, and the structured output. Measure on the test set. Many gaps close right here. It is cheap and easy to undo.

    usesStructured OutputPrompt Versioning

  3. Step 2: few-shot prompting

    Add a few hand-picked examples in the prompt. They show the format and the reasoning you want. Measure again. This works well for form and style problems.

  4. Step 3: retrieval-augmented generation

    If facts are the problem, add retrieval. Start with the simplest setup and measure. Move to advanced retrieval, such as reranking, query rewriting, and layered search, only if the simple version stalls.

    usesNaive RAGAgentic RAGContextual RetrievalCross-Encoder Reranking

  5. Step 4: task decomposition or agent orchestration

    If the task is too big for one call, split it. Use plan-and-execute, prompt chaining, or query decomposition. Each piece then becomes a smaller, easier problem.

    usesPlan-and-ExecutePrompt ChainingQuery-Decomposition Agent

  6. Step 5: fine-tune

    Only now. Fine-tune for form, a fixed response shape, or to shrink a big model into a small one. Retrieval is still for facts, so do not try to fine-tune knowledge in. Record the scores from every earlier step. That way the data, not excitement, justifies the cost of fine-tuning.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Fine-tuning is for form. Retrieval is for facts.
  • Climb in order. Cheap steps first, hard-to-undo steps last.
  • Diagnose before you climb. The wrong step wastes the most expensive lever.
  • Record the score at every step, so data justifies the cost of fine-tuning, not team excitement.

Known failure modes (3)

Related patterns (9)

Related compositions (2)

Related methodologies (3)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified