Multi-Model Routing
also known as Cascade Routing, Cheap-First Routing, Model Cascading
Send each request to the cheapest model that can handle it well.
This pattern helps complete certain larger patterns —
- specialisesRouting★★— Classify an incoming request and dispatch it to the specialist (lane / agent / model) best suited to handle it.
- used-byDual-System GUI Agent★— Split a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.
- used-byDegenerate-Output Detection★— Detect when the agent is about to emit a near-duplicate of its own recent output and either drop, replace, or escalate to a stronger model rather than ship the loop.
Context
A team is building a production agent and has access to several language models from one or more providers — typically a small cheap model, a mid-tier model, and a frontier model whose per-token price is an order of magnitude higher. The traffic mix is realistic: a lot of the requests are simple extractions, classifications, or rephrasings, while a smaller share genuinely needs the frontier model's depth. The team has to decide which model handles each kind of request.
Problem
If every request is routed to the frontier model, the bill is wildly larger than it needs to be because the cheap model would have handled most of the traffic at the same quality. If every request is routed to the cheap model, the hard cases come back wrong with no signal that a better model was available. A static single-model choice forces a bad compromise, and naive escalation that always tries the cheap model first and falls back to the strong one on failure can cost more than starting with the strong model.
Forces
- Quality bar must be measurable per request type.
- Cheap models hallucinate confidently; the router must not trust them blindly.
- Falling back from cheap to expensive on failure costs more than starting expensive.
Example
A SaaS company is paying frontier-model prices for every request, including 'what's the weather in Berlin' and 'extract emails from this paragraph'. The team adds multi-model-routing: a tiny classifier routes simple extractions and routing decisions to a cheap small model and reserves the expensive frontier model for the screen-aware dialog and final answers. A confidence cascade falls back to the strong model when the cheap one returns low-confidence. Total token cost drops by 60 percent with no measurable quality loss on the eval set.
Diagram
Solution
Therefore:
Combine routing (classify the request) with a per-class model preference. Routing and filter extraction go to the cheap model; the screen-aware dialog or final answer goes to the strong model. Optionally cascade: try cheap, fall back to strong if confidence is low.
What this pattern forbids. Each request class is bound to a model tier; agents cannot escalate without routing approval.
The smaller patterns that complete this one —
- generalisesOpen-Weight Cascade★— Build a multi-model cascade where lower tiers are open-weight, self-hostable models that run inside the operator's boundary, and only escalations cross to a hosted frontier model — giving cost arbitrage *and* sovereignty.
- generalisesComplexity-Based Routing★— Estimate a request's difficulty up front and bind it to the cheapest model tier that can answer well, using an explicit complexity classifier as the routing key.
And the patterns that stand alongside it, or against it —
- complementsCost Gating★★— Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
- complementsFallback Chain★★— Try a primary handler; on failure or low confidence, fall through to a sequence of fallback handlers.
- alternative-toHero Agent✕— Anti-pattern: stuff every capability into one agent with one giant prompt.
- complementsProvider Fallback★★— When one provider's API errors mid-stream, transparently switch to another provider while preserving state.
- alternative-toHidden Mode Switching✕— Anti-pattern: silently swap the underlying model between requests without disclosing the change to users or operators.
- complementsMultilingual Voice Agent Stack★— Compose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.
- alternative-toRL-Trained Conductor Orchestrator·— Train a small meta-model with reinforcement learning to dispatch sub-tasks across a pool of frontier LLM workers, learning the communication topology end-to-end and allowing the conductor to recursively invoke itself as a worker.
- complementsProvider-String Routing★— Select the model and provider for a request through a single namespaced string (`provider/model`) backed by env-var credentials, so the caller specifies what to run with one parameter rather than a typed provider object.
- alternative-toVendor Lock-In✕— Anti-pattern: couple application code directly to one model provider's SDK, request shape, and proprietary features so that switching providers requires rewriting application code rather than swapping an adapter.
- complementsAdaptive Compute Allocation★— Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
- complementsHybrid Symbolic-Neural Routing★— Per query, route between a symbolic path (rule engine, knowledge graph) and a neural path (LLM), using the LLM for interpretation and the symbolic layer for exact constraints.
- alternative-toHierarchical Retrieval★★— Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.
- alternative-toTop-Tier Model For Everything (Cost)✕— Anti-pattern: route every request through the highest-tier model regardless of difficulty, treating cost as a model-choice problem instead of a routing one.
- complementsLarge Action Models (LAMs)·— Use a model class specifically trained for action execution (tool calls, UI navigation, workflow steps) rather than text generation, when the workload is dominated by reliably completing actions in real systems.
- complementsMRKL Systems (Modular Neuro-Symbolic)★★— Route each request through an LLM dispatcher to specialized symbolic or neural expert modules (calculator, knowledge base, code executor) rather than asking one LLM to do everything; integrate the modules' results for the final response.
- complementsLarge Reasoning Model (LRM) Paradigm★— Route reasoning-heavy tasks to a reasoning-tuned model that trades inference time for deliberation, rather than to a fast LLM that exhibits premature-closure.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.