Methodology · LLM-App Engineeringprovenverified

Model Selection Workflow

also known as four-phase model selection, private leaderboard build

Applies to: llm-appagentrag-systemcoding-agent

Tags: model-selectionprivate-leaderboardevaluation

Pick a foundation model in four steps, in order. First, throw out any model that breaks a hard rule, such as licence, data location, or what kinds of input it handles. Second, narrow the rest using public benchmarks and leaderboards. Third, test the survivors on your real task and rank them in your own private leaderboard. Fourth, keep watching the chosen model in production, using the same scores, so you catch when it slips. The thing you keep is the private leaderboard, not a one-off 'we picked GPT-something' call.

Methodology process overview

Intent. Turn model selection into a repeatable four-step routine. The output is a private leaderboard and a live monitor, not a one-time decision.

When to apply. Use this when you start any LLM application where the model really matters, such as chat, retrieval-augmented generation (RAG), an agent, classification, or text generation. Run it before you lock down your first prompt. Run it again when a strong new model ships or when production numbers move. Don't apply it when an outside rule already forces one model, such as a regulation or a single-vendor contract. Skip it for throwaway prototypes that last under a day. For those, just use the cheapest model that might work.

Inputs

  • Use case specificationWhat the model must do, written as a task plus a quality bar.
  • Hard constraintsRules you cannot bend. These include where data may live, licensing, what input types the model takes, the slowest speed you accept, and the most you will pay per call.
  • Public benchmarks and leaderboardsPublic evidence about each candidate model, such as provider docs, papers, and outside leaderboards.
  • Application eval setTest inputs and a scoring guide built for your real task. Do not use generic benchmarks here.

Outputs

  • Private leaderboardA ranked list of candidate models, scored on your real test set, kept under version control in the repo.
  • Selection recordA written reason for the pick. It ties the chosen model back to the hard rules and the leaderboard scores.
  • Production monitorA live pipeline that watches the chosen model and flags when its quality or cost drifts away from the leaderboard baseline.

Steps (5)

  1. Apply the hard-constraint filter

    List the rules you cannot bend. These cover where data may live, licensing, what input types the model takes, context length, output size, the slowest speed you accept, and the most you will pay. Drop every model that breaks any rule. Teams waste time when these rules stay in their heads. Write them down and the candidate list shrinks fast.

  2. Screen on public information

    For the models that survived, read provider model cards, outside leaderboards such as LMSys, HELM, and HuggingFace Open LLM, and recent papers. Build a shortlist of three to five models the public evidence says could work. Public benchmarks are a weak signal. They narrow the field. They do not pick the winner.

  3. Build the private leaderboard

    Run each shortlisted model on your real test set. Use the same scoring guide for every model. Record each model's score, cost per call, speed, and the ways it fails. This leaderboard is the thing you keep, not a one-time analysis.

    usesDimensional Synthetic Eval SetFrozen Rubric Reflection

  4. Select and document

    Pick the model whose score, cost, and speed best fit your task. Write down why it beat the runners-up. Pin the versions of the test set and scoring guide you used, so anyone can repeat the decision.

  5. Monitor in production

    Wire the live system to record the same numbers. Track quality against a small live test slice, cost per call, and speed. When a number drifts past your threshold, start the workflow again.

    usesCost ObservabilityScorer Live Monitoring

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • The private leaderboard is the thing you keep. Public benchmarks only make the shortlist.
  • Hard rules are filters, not preferences. Apply them first to shrink the candidate list.
  • The pick is not one and done. Watching the model in production is part of the work.
  • Score every model on the same guide and the same test set, or the comparison is just for show.

Known failure modes (2)

Related patterns (5)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified