Test-Time Compute Scaling

also known as Inference-Time Scaling, Compute-Time Trade-Off

Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.

Context

A team is at a quality ceiling on a hard workload — math benchmarks, code reasoning, complex planning — and the obvious move of waiting for the next generation of a larger model is either unavailable or too expensive. They have inference budget they could spend, and they have noticed that some classes of problem respond well to spending more compute at answer-time rather than at training-time.

Problem

A single-pass call to even a strong model under-uses the compute available at inference time. The team knows several inference-time techniques exist — drawing many samples and picking the best, voting across many samples, searching over reasoning trees, allocating more internal reasoning tokens — but each technique shines on a different kind of task. Without a deliberate policy for how to spend inference budget per task class, the team leaves easy quality gains on the floor and pays too much on the items that would not have benefited.

Forces

Wall-clock latency rises with compute.
Cost rises linearly or worse with sample count.
Best technique (samples / search / deeper thinking) is task-dependent.

Example

A team has a hard math benchmark where their current model underperforms; the obvious move is to wait for a larger model. Instead they apply test-time compute scaling: best-of-N sampling with a verifier for verifier-amenable items, self-consistency for sampling-amenable items, tree search for combinatorial items, extended thinking for sequential reasoning. Per-item cost rises but accuracy on the benchmark beats the next-tier model at lower total cost.

Diagram

flowchart TD Q[Request] --> Class{Task class?} Class -->|verifier-amenable| BoN[Best-of-N] Class -->|sampling-amenable| SC[Self-consistency] Class -->|combinatorial| Tree[Tree search] Class -->|sequential| ET[Extended thinking] BoN --> Comp[Compose where complementary] SC --> Comp Tree --> Comp ET --> Comp Comp --> Out[Answer at tuned compute budget]

Solution

Therefore:

Pick the inference-time technique that fits: best-of-N for verifier-amenable tasks, self-consistency for sampling-amenable tasks, tree search for combinatorial tasks, extended thinking for sequential reasoning. Compose techniques where complementary. Tune the compute budget per task class.

What this pattern forbids. Each request specifies its compute budget; over-budget requests are cut off.

The smaller patterns that complete this one —

generalisesExtended Thinking★★— Spend a configurable budget of internal reasoning tokens before producing a user-visible answer.
generalisesBest-of-N Sampling★— Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
generalisesSelf-Consistency★★— Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.
generalisesLanguage Agent Tree Search·— Lift the agent loop into a search tree with a learned value function and backtracking.
generalisesProcess Reward Model★— Train a verifier that scores each reasoning step rather than only the final answer.
generalisesAdaptive Branching Tree Search·— At each node of an inference-time search tree, use Thompson sampling to decide whether to deepen an existing answer or branch a fresh attempt, optionally choosing per-node which underlying LLM to invoke.
generalisesAdaptive Compute Allocation★— Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
generalisesTrajectory-Summary Test-Time Scaling·— When an agent's outputs are extended action-observation trajectories rather than short answers, scale test-time compute by compressing each rollout into a structured summary and selecting or reusing across those summaries instead of raw traces.

And the patterns that stand alongside it, or against it —

alternative-toSleep-Time Compute·— During idle or downtime, run the model offline against the user's standing context to pre-compute dense summaries and likely future answers, so test-time latency and cost drop when the user actually asks.
complementsLarge Reasoning Model (LRM) Paradigm★— Route reasoning-heavy tasks to a reasoning-tuned model that trades inference time for deliberation, rather than to a fast LLM that exhibits premature-closure.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

OpenAI Agents SDK
core13 patternsAgent SDKs★★ mature
OpenAI documents that o-series quality keeps climbing with more inference-time reasoning — the same more-compute-equals-better trend extended into RL and test-time thinking — expo…

References

Provenance

Source: patterns/test-time-compute-scaling.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.