Model Selection Workflow
also known as four-phase model selection, private leaderboard build
Pick a foundation model in four steps, in order. First, throw out any model that breaks a hard rule, such as licence, data location, or what kinds of input it handles. Second, narrow the rest using public benchmarks and leaderboards. Third, test the survivors on your real task and rank them in your own private leaderboard. Fourth, keep watching the chosen model in production, using the same scores, so you catch when it slips. The thing you keep is the private leaderboard, not a one-off 'we picked GPT-something' call.
Methodology process overview
Intent. Turn model selection into a repeatable four-step routine. The output is a private leaderboard and a live monitor, not a one-time decision.
When to apply. Use this when you start any LLM application where the model really matters, such as chat, retrieval-augmented generation (RAG), an agent, classification, or text generation. Run it before you lock down your first prompt. Run it again when a strong new model ships or when production numbers move. Don't apply it when an outside rule already forces one model, such as a regulation or a single-vendor contract. Skip it for throwaway prototypes that last under a day. For those, just use the cheapest model that might work.
Inputs
- Use case specification — What the model must do, written as a task plus a quality bar.
- Hard constraints — Rules you cannot bend. These include where data may live, licensing, what input types the model takes, the slowest speed you accept, and the most you will pay per call.
- Public benchmarks and leaderboards — Public evidence about each candidate model, such as provider docs, papers, and outside leaderboards.
- Application eval set — Test inputs and a scoring guide built for your real task. Do not use generic benchmarks here.
Outputs
- Private leaderboard — A ranked list of candidate models, scored on your real test set, kept under version control in the repo.
- Selection record — A written reason for the pick. It ties the chosen model back to the hard rules and the leaderboard scores.
- Production monitor — A live pipeline that watches the chosen model and flags when its quality or cost drifts away from the leaderboard baseline.
Steps (5)
Apply the hard-constraint filter
List the rules you cannot bend. These cover where data may live, licensing, what input types the model takes, context length, output size, the slowest speed you accept, and the most you will pay. Drop every model that breaks any rule. Teams waste time when these rules stay in their heads. Write them down and the candidate list shrinks fast.
Screen on public information
For the models that survived, read provider model cards, outside leaderboards such as LMSys, HELM, and HuggingFace Open LLM, and recent papers. Build a shortlist of three to five models the public evidence says could work. Public benchmarks are a weak signal. They narrow the field. They do not pick the winner.
Build the private leaderboard
Run each shortlisted model on your real test set. Use the same scoring guide for every model. Record each model's score, cost per call, speed, and the ways it fails. This leaderboard is the thing you keep, not a one-time analysis.
Select and document
Pick the model whose score, cost, and speed best fit your task. Write down why it beat the runners-up. Pin the versions of the test set and scoring guide you used, so anyone can repeat the decision.
Monitor in production
Wire the live system to record the same numbers. Track quality against a small live test slice, cost per call, and speed. When a number drifts past your threshold, start the workflow again.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- The private leaderboard is the thing you keep. Public benchmarks only make the shortlist.
- Hard rules are filters, not preferences. Apply them first to shrink the candidate list.
- The pick is not one and done. Watching the model in production is part of the work.
- Score every model on the same guide and the same test set, or the comparison is just for show.
Known failure modes (2)
- ✕Top-Tier Model For Everything (Cost)
Skipping the workflow and reaching for the flagship model on every call — cost balloons without measured quality gain.
- ✕Errors Swept Under the Rug
Picking on public benchmarks and never building the private leaderboard — the model's failure modes on the actual use case stay invisible.
Related patterns (5)
- ★★Multi-Model Routing
Send each request to the cheapest model that can handle it well.
- ★★Cost Observability
Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.
- ★Dimensional Synthetic Eval Set
Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
- ★Scorer Live Monitoring
Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
Related compositions (2)
- recipe · abstract shapeProduction LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.
- recipe · abstract shapeEval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.
Related methodologies (2)
- Evaluation-Driven Development★★
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
- Build-or-Buy Foundation Model Decision★★
Replace gut-feel calls like 'use OpenAI' or 'self-host Llama' with a seven-factor comparison whose verdicts and weights are written down.
Sources (2)
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified