Top-Tier Model For Everything (Cost)
also known as Always-Use-The-Best-Model, Frontier-Model Default
Anti-pattern: route every request through the highest-tier model regardless of difficulty, treating cost as a model-choice problem instead of a routing one.
Context
A team picks the strongest available model (Opus, GPT-5.x) during prototyping for maximum quality. The wrapper defaults are kept in production. Every classification, every extraction, every summarization, every routine reply goes through the most expensive model the team can buy.
Problem
Cost grows 5–20× compared to a tiered system, with no measurable quality benefit on the easy 80–90% of traffic. The team only notices when the bill arrives. Rationalizations like 'quality matters' or 'simpler to have one model' justify it post-hoc. When budget pressure forces a fix, the team has no telemetry on per-request difficulty and cannot route safely.
Forces
- Top-tier models are obviously fine for everything; weaker models are not obviously fine.
- Telemetry to measure per-request difficulty does not exist by default; the team has to build it.
- 'Quality matters' is hard to argue against without numbers.
Example
A SaaS app classifies user feedback (positive/negative/feature-request) using GPT-5 with full reasoning. 300k calls per month. Bill: $24k. A simple fine-tuned classifier hits 96% accuracy at $20/month. The team only fixes it after a board cost-review forces the question.
Diagram
Solution
Therefore:
Build a routing layer that classifies each request by difficulty (heuristic, classifier, or fast model judgement) and routes to the smallest model that handles its class well. Reserve the top tier for requests escalated by low confidence, high stakes, or explicit user choice. Pair with complexity-based-routing and multi-model-routing. Track cost-per-request as a first-class metric.
What this pattern forbids. No useful constraint; the missing constraint is per-request difficulty-based routing.
And the patterns that stand alongside it, or against it —
- alternative-toComplexity-Based Routing★— Estimate a request's difficulty up front and bind it to the cheapest model tier that can answer well, using an explicit complexity classifier as the routing key.
- alternative-toMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
- complementsOpen-Weight Cascade★— Build a multi-model cascade where lower tiers are open-weight, self-hostable models that run inside the operator's boundary, and only escalations cross to a hosted frontier model — giving cost arbitrage *and* sovereignty.
- complementsMixture of Experts Routing★— Route each request to one or more domain-expert agents, where each expert holds deep capability in a narrow area.
- complementsCost Observability★★— Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.
- complementsRealtime API When Batchable✕— Anti-pattern: use the realtime/synchronous model API for workloads whose latency budget would permit batching, paying 2–10× the unit cost for no user-visible benefit.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.