Reasoning

Adaptive Compute Allocation

Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.

Problem

Static compute budgets force a single trade-off across all queries. With LLM inference cost dominating production economics, the slack on simple queries is large; the deficit on hard queries is real. Recent work (the 2025 arXiv survey 'Reasoning on a Budget', the 2026 ACM Web Conference paper on adaptive routing) shows that input-conditional allocation can reduce cost without sacrificing quality — but only if there is a reliable signal for per-query difficulty available before commitment.

Solution

Adopt a per-query budget pipeline: cheap difficulty estimator picks initial budget; partial-output signals (low self-consistency, low model confidence, branching mid-reasoning) trigger budget ramp; hard ceiling on budget per query prevents runaway. Variants include model routing (small model first, escalate on uncertainty), thinking-token budget control, and sample-count adaptation. Distinct from test-time-compute-scaling by being explicitly input-conditional.

When to use

  • Production deployments where mean-query cost dominates and query difficulty varies widely.
  • Reasoning agents with extended-thinking / sample-count controls available.
  • Multi-model setups where smaller and larger models can be selected per query.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related