Adaptive Compute Allocation

also known as Input-Adaptive Thinking Budget, Per-Query Compute Routing, Adaptive Thinking

Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.

This pattern helps complete certain larger patterns —

specialisesTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.

Context

A reasoning agent or inference router serves queries of widely varying difficulty: simple lookups, moderate multi-step reasoning, hard novel problems. Compute per query is the dominant cost. The trivial policy — fixed budget across all queries — either wastes compute on simple ones or under-serves hard ones.

Problem

Static compute budgets force a single trade-off across all queries. With LLM inference cost dominating production economics, the slack on simple queries is large; the deficit on hard queries is real. Recent work (the 2025 arXiv survey 'Reasoning on a Budget', the 2026 ACM Web Conference paper on adaptive routing) shows that input-conditional allocation can reduce cost without sacrificing quality — but only if there is a reliable signal for per-query difficulty available before commitment.

Forces

Compute is expensive; over-allocation wastes; under-allocation produces wrong answers.
Per-query difficulty is not always knowable upfront; some signals (self-consistency, model-uncertainty) require partial generation to read.
Routing-quality and routing-overhead trade off — a complex router can eat the savings.

Example

A customer-support assistant serves 10M queries/month. Profiling shows ~70% are FAQ-style (one-shot), ~25% are multi-step (need plan+execute), ~5% are genuinely novel (need extended thinking). Current setup uses a fixed extended-thinking budget on every query. The team adds a difficulty estimator: a small classifier scores prompt complexity, routes the 70% to the small fast path with no thinking tokens, the 25% to a moderate budget, the 5% to the full extended-thinking budget. Net inference cost drops 60% with no quality regression on production traffic.

Diagram

flowchart TD Q[Incoming query] --> D[Cheap difficulty estimator] D -- easy --> S[Small budget / fast model] D -- moderate --> M[Moderate budget / extended thinking] D -- hard --> L[Full budget / large model] S --> Ans1[Answer] M --> C{Confidence ok?} L --> Ans2[Answer] C -- yes --> Ans3[Answer] C -- no --> L

Solution

Therefore:

Adopt a per-query budget pipeline: cheap difficulty estimator picks initial budget; partial-output signals (low self-consistency, low model confidence, branching mid-reasoning) trigger budget ramp; hard ceiling on budget per query prevents runaway. Variants include model routing (small model first, escalate on uncertainty), thinking-token budget control, and sample-count adaptation. Distinct from test-time-compute-scaling by being explicitly input-conditional.

What this pattern forbids. Imposes a per-query difficulty estimation step before commitment to a compute level; constrains compute budgets to be elastic per query rather than flat across the deployment.

And the patterns that stand alongside it, or against it —

complementsSleep-Time Compute·— During idle or downtime, run the model offline against the user's standing context to pre-compute dense summaries and likely future answers, so test-time latency and cost drop when the user actually asks.
complementsMode-Adaptive Cadence★— Vary the agent's loop interval based on current salience so the agent thinks faster when something is happening and slower when nothing is, instead of running on a fixed cron.
complementsMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
complementsProcess Reward Model★— Train a verifier that scores each reasoning step rather than only the final answer.
complementsComplexity-Based Routing★— Estimate a request's difficulty up front and bind it to the cheapest model tier that can answer well, using an explicit complexity classifier as the routing key.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

Vertex AI Agent Builder
first-class13 patternsEnterprise Platforms★★ mature
Gemini's thinkingBudget parameter set to -1 enables dynamic thinking, where the model itself adjusts the number of thinking tokens spent based on the difficulty/complexity of each…

References

Provenance

Source: patterns/adaptive-compute-allocation.md on GitHub · commit 159e600 · view history
Added to catalog: 2026-05-21
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.