Test-Time Compute Scaling
also known as Inference-Time Scaling, Compute-Time Trade-Off
Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.
Context
A team is at a quality ceiling on a hard workload — math benchmarks, code reasoning, complex planning — and the obvious move of waiting for the next generation of a larger model is either unavailable or too expensive. They have inference budget they could spend, and they have noticed that some classes of problem respond well to spending more compute at answer-time rather than at training-time.
Problem
A single-pass call to even a strong model under-uses the compute available at inference time. The team knows several inference-time techniques exist — drawing many samples and picking the best, voting across many samples, searching over reasoning trees, allocating more internal reasoning tokens — but each technique shines on a different kind of task. Without a deliberate policy for how to spend inference budget per task class, the team leaves easy quality gains on the floor and pays too much on the items that would not have benefited.
Forces
- Wall-clock latency rises with compute.
- Cost rises linearly or worse with sample count.
- Best technique (samples / search / deeper thinking) is task-dependent.
Example
A team has a hard math benchmark where their current model underperforms; the obvious move is to wait for a larger model. Instead they apply test-time compute scaling: best-of-N sampling with a verifier for verifier-amenable items, self-consistency for sampling-amenable items, tree search for combinatorial items, extended thinking for sequential reasoning. Per-item cost rises but accuracy on the benchmark beats the next-tier model at lower total cost.
Diagram
Solution
Therefore:
Pick the inference-time technique that fits: best-of-N for verifier-amenable tasks, self-consistency for sampling-amenable tasks, tree search for combinatorial tasks, extended thinking for sequential reasoning. Compose techniques where complementary. Tune the compute budget per task class.
What this pattern forbids. Each request specifies its compute budget; over-budget requests are cut off.
The smaller patterns that complete this one —
- generalisesExtended Thinking★★— Spend a configurable budget of internal reasoning tokens before producing a user-visible answer.
- generalisesBest-of-N Sampling★— Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
- generalisesSelf-Consistency★★— Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.
- generalisesLanguage Agent Tree Search·— Lift the agent loop into a search tree with a learned value function and backtracking.
- generalisesProcess Reward Model★— Train a verifier that scores each reasoning step rather than only the final answer.
- generalisesAdaptive Branching Tree Search·— At each node of an inference-time search tree, use Thompson sampling to decide whether to deepen an existing answer or branch a fresh attempt, optionally choosing per-node which underlying LLM to invoke.
- generalisesAdaptive Compute Allocation★— Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
And the patterns that stand alongside it, or against it —
- alternative-toSleep-Time Compute·— During idle or downtime, run the model offline against the user's standing context to pre-compute dense summaries and likely future answers, so test-time latency and cost drop when the user actually asks.
- complementsLarge Reasoning Model (LRM) Paradigm★— Route reasoning-heavy tasks to a reasoning-tuned model that trades inference time for deliberation, rather than to a fast LLM that exhibits premature-closure.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.