Adaptive Compute Allocation
Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
Problem
Static compute budgets force a single trade-off across all queries. With LLM inference cost dominating production economics, the slack on simple queries is large; the deficit on hard queries is real. Recent work (the 2025 arXiv survey 'Reasoning on a Budget', the 2026 ACM Web Conference paper on adaptive routing) shows that input-conditional allocation can reduce cost without sacrificing quality — but only if there is a reliable signal for per-query difficulty available before commitment.
Solution
Adopt a per-query budget pipeline: cheap difficulty estimator picks initial budget; partial-output signals (low self-consistency, low model confidence, branching mid-reasoning) trigger budget ramp; hard ceiling on budget per query prevents runaway. Variants include model routing (small model first, escalate on uncertainty), thinking-token budget control, and sample-count adaptation. Distinct from test-time-compute-scaling by being explicitly input-conditional.
When to use
- Production deployments where mean-query cost dominates and query difficulty varies widely.
- Reasoning agents with extended-thinking / sample-count controls available.
- Multi-model setups where smaller and larger models can be selected per query.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.