Methodology · LLM-App Engineeringprovenverified

Model Selection Workflow

also known as four-phase model selection, private leaderboard build

Applies to: llm-appagentrag-systemcoding-agent

Tags: model-selectionprivate-leaderboardevaluation

Pick a foundation model in four steps, in order. First, throw out any model that breaks a hard rule, such as licence, data location, or what kinds of input it handles. Second, narrow the rest using public benchmarks and leaderboards. Third, test the survivors on your real task and rank them in your own private leaderboard. Fourth, keep watching the chosen model in production, using the same scores, so you catch when it slips. The thing you keep is the private leaderboard, not a one-off 'we picked GPT-something' call.

Methodology process overview

flowchart TD in1[Use case spec] --> s1[Apply hard-constraint filter] in2[Hard constraints] --> s1 s1 --> s2[Screen on public information] in3[Public benchmarks] --> s2 s2 --> s3[Build private leaderboard] in4[Application eval set] --> s3 s3 --> s4[Select and document] s4 --> out1[Selection record] s4 --> s5[Monitor in production] s5 --> out2[Production monitor] s5 -->|drift exceeds threshold| s1 s3 --> out3[Private leaderboard]

Intent. Turn model selection into a repeatable four-step routine. The output is a private leaderboard and a live monitor, not a one-time decision.

When to apply. Use this when you start any LLM application where the model really matters, such as chat, retrieval-augmented generation (RAG), an agent, classification, or text generation. Run it before you lock down your first prompt. Run it again when a strong new model ships or when production numbers move. Don't apply it when an outside rule already forces one model, such as a regulation or a single-vendor contract. Skip it for throwaway prototypes that last under a day. For those, just use the cheapest model that might work.

Example scenario

A fintech was building a customer-support copilot for European retail-banking customers. It ran this workflow before committing to any model. The hard-constraint filter removed every US-hosted-only API right away. GDPR and the bank's data residency policy required EU processing. That single filter cut the candidate pool from twelve to four. The team then screened the four survivors on public information: Mistral Large, self-hosted Llama 3 70B, Claude on AWS Bedrock EU, and a Cohere EU deployment. Instruction-following and multilingual benchmarks shortlisted three. The team then built a private leaderboard on 240 real anonymised customer tickets. They scored on a five-axis rubric: accuracy, tone, escalation correctness, refusal calibration, and average cost per resolution. Claude on Bedrock EU won on accuracy and tone. Self-hosted Llama 3 70B was 60% cheaper but lost on refusal calibration. They picked Claude and documented the trade-off. Then they stood up a live monitor sampling 1% of production traffic against the same rubric. Three months later the monitor flagged a regression after a model version rollover. That kicked the workflow back to step three rather than setting off a panic. The lesson the team kept: the public-benchmark shortlist was useful, but wrong on cost-per-resolution by 2x, because their ticket distribution was skewed toward short answers. The private leaderboard was load-bearing.

Inputs

Use case specification — What the model must do, written as a task plus a quality bar.
Hard constraints — Rules you cannot bend. These include where data may live, licensing, what input types the model takes, the slowest speed you accept, and the most you will pay per call.
Public benchmarks and leaderboards — Public evidence about each candidate model, such as provider docs, papers, and outside leaderboards.
Application eval set — Test inputs and a scoring guide built for your real task. Do not use generic benchmarks here.

Outputs

Private leaderboard — A ranked list of candidate models, scored on your real test set, kept under version control in the repo.
Selection record — A written reason for the pick. It ties the chosen model back to the hard rules and the leaderboard scores.
Production monitor — A live pipeline that watches the chosen model and flags when its quality or cost drifts away from the leaderboard baseline.

Steps (5)

Apply the hard-constraint filter
List the rules you cannot bend. These cover where data may live, licensing, what input types the model takes, context length, output size, the slowest speed you accept, and the most you will pay. Drop every model that breaks any rule. Teams waste time when these rules stay in their heads. Write them down and the candidate list shrinks fast.
Screen on public information
For the models that survived, read provider model cards, outside leaderboards such as LMSys, HELM, and HuggingFace Open LLM, and recent papers. Build a shortlist of three to five models the public evidence says could work. Public benchmarks are a weak signal. They narrow the field. They do not pick the winner.
Build the private leaderboard
Run each shortlisted model on your real test set. Use the same scoring guide for every model. Record each model's score, cost per call, speed, and the ways it fails. This leaderboard is the thing you keep, not a one-time analysis.
usesDimensional Synthetic Eval Set Frozen Rubric Reflection
Select and document
Pick the model whose score, cost, and speed best fit your task. Write down why it beat the runners-up. Pin the versions of the test set and scoring guide you used, so anyone can repeat the decision.
Monitor in production
Wire the live system to record the same numbers. Track quality against a small live test slice, cost per call, and speed. When a number drifts past your threshold, start the workflow again.
usesCost Observability Scorer Live Monitoring

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

The private leaderboard is the thing you keep. Public benchmarks only make the shortlist.
Hard rules are filters, not preferences. Apply them first to shrink the candidate list.
The pick is not one and done. Watching the model in production is part of the work.
Score every model on the same guide and the same test set, or the comparison is just for show.

Model Selection Workflow

Methodology process overview

Steps (5)

Apply the hard-constraint filter

Screen on public information

Build the private leaderboard

Select and document

Monitor in production

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (5)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance