Open-Weight Cascade
also known as Permissive-License Cascade, Sovereign Routing, Self-Hostable Cascade
Build a multi-model cascade where lower tiers are open-weight, self-hostable models that run inside the operator's boundary, and only escalations cross to a hosted frontier model — giving cost arbitrage *and* sovereignty.
This pattern helps complete certain larger patterns —
- specialisesMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
Context
An operator in a regulated environment — a European bank, a healthcare provider, a government agency — is building an agent and wants both the cost benefits of a multi-tier model cascade and the assurance that sensitive data does not leave their controlled boundary. Open-weight models that can be self-hosted have become capable enough to handle most requests at low cost, but a small share of hard requests still benefit from a hosted frontier model. The operator already runs at least one open-weight model on infrastructure they control.
Problem
A simple cheap-first cascade routes the easy requests to an open-weight model and the hard ones to a hosted frontier model, which means every borderline request quietly leaks its data to a vendor outside the regulated boundary. An open-weight-only cascade keeps everything in-house but takes a noticeable capability hit on the rare hard request that really needs the frontier model. Neither extreme satisfies the operator who needs cost arbitrage on insensitive traffic and strict in-boundary processing on sensitive traffic.
Forces
- Most requests are easy; cheap models handle them.
- Hard requests need frontier capability.
- Some requests must never leave the boundary regardless of difficulty.
- Open-weight models close the capability gap at a delay.
Example
A European bank wants the cost-and-quality benefits of a multi-tier model cascade but is bound by data-residency rules that forbid sending customer queries to a hosted US frontier model. The team builds an open-weight-cascade: requests are first stratified by sensitivity, sensitive ones are forced down the on-prem open-weight tier (and degrade or refuse rather than escalate), and only insensitive hard requests are allowed to escalate to the hosted frontier model. They get the cost arbitrage without violating residency for sensitive traffic.
Diagram
Solution
Therefore:
Stratify requests by sensitivity *and* difficulty before routing. (1) Sensitive requests: forced down the open-weight path even if confidence is low; degrade gracefully or refuse rather than escalate. (2) Insensitive easy requests: small open-weight model. (3) Insensitive hard requests: escalate to hosted frontier model. The router enforces the sensitivity classification before any model call.
What this pattern forbids. A request classified as sensitive may not be routed to a hosted frontier model; the hosted tier is only reachable from the insensitive path.
The smaller patterns that complete this one —
- usesFallback Chain★★— Try a primary handler; on failure or low confidence, fall through to a sequence of fallback handlers.
And the patterns that stand alongside it, or against it —
- complementsSovereign Inference Stack★— Run the entire agent stack (model weights, inference, tool layer, vector stores, logs) inside a jurisdictional and operational boundary the operator controls, so no request, prompt, or output crosses into a third-party API.
- complementsPII Redaction★★— Detect and remove personally identifiable information from inputs to and outputs from the model.
- complementsProvider Fallback★★— When one provider's API errors mid-stream, transparently switch to another provider while preserving state.
- complementsAgentic Supply Chain Compromise✕— Anti-pattern: compose agent capabilities at runtime from third-party tools, RAG sources, model providers, plugin marketplaces, and tool definitions, with no integrity check on what loaded.
- alternative-toComplexity-Based Routing★— Estimate a request's difficulty up front and bind it to the cheapest model tier that can answer well, using an explicit complexity classifier as the routing key.
- complementsTop-Tier Model For Everything (Cost)✕— Anti-pattern: route every request through the highest-tier model regardless of difficulty, treating cost as a model-choice problem instead of a routing one.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.