Trajectory-Summary Test-Time Scaling

also known as Rollout-Summary Scaling, Recursive Tournament Voting

When an agent's outputs are extended action-observation trajectories rather than short answers, scale test-time compute by compressing each rollout into a structured summary and selecting or reusing across those summaries instead of raw traces.

This pattern helps complete certain larger patterns —

specialisesTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.

Context

Test-time scaling improves quality by spending more inference compute — sampling several attempts, voting, or searching — and it works cleanly when each attempt is a short, directly comparable answer. Long-horizon agents break that assumption: a single attempt is a trajectory of dozens of tool calls and observations, far too long to compare verbatim and too noisy to majority-vote token by token. There is still good reason to spend more compute to raise success rates on these long tasks.

Problem

Best-of-N and self-consistency assume the candidates are short outputs that a reward model can rank or a vote can aggregate. A hundred-step trajectory is neither: its salient content — the hypotheses tried, the progress made, the dead ends — is buried in low-signal trace detail, so naive ranking compares noise and naive voting has nothing to count. Without a comparable representation, extra rollouts add cost but not a reliable way to pick or combine the best one.

Forces

Spending more rollouts raises the chance one of them succeeds, but only if there is a reliable way to identify or combine the good ones.
A trajectory's decision-relevant content is a small fraction of its tokens, so comparing or voting over raw traces drowns the signal.
Compressing a rollout to a summary risks discarding the very detail that distinguishes a good attempt from a plausible-looking bad one.
Sequential reuse (re-rolling conditioned on prior summaries) and parallel selection (voting across summaries) need the same summary representation but spend compute differently.

Example

A coding agent gets eight attempts at a hard multi-file bug. Each attempt is about eighty tool calls long, so the attempts cannot be compared directly. The system summarises each as 'tried X, fixed the import but tests still fail on the date parser' and runs a tournament over the summaries; the winner — the one whose summary shows passing tests — is returned, and its summary also seeds a ninth attempt that starts where it left off.

Diagram

flowchart TD T[Long task] --> R1[Rollout 1] T --> R2[Rollout 2] T --> R3[Rollout N] R1 --> S1[Summary 1] R2 --> S2[Summary 2] R3 --> S3[Summary N] S1 --> V[Tournament vote over summaries] S2 --> V S3 --> V V --> W[Selected result]

Solution

Therefore:

Run several rollouts of the long task, and convert each into a structured summary that keeps its salient hypotheses, progress, and failure modes while shedding low-signal trace detail. Scale in two directions over these summaries. For parallel scaling, compare summaries against each other — for example by recursive tournament voting — to select the strongest attempt without ever diffing raw traces. For sequential scaling, feed the summaries of earlier rollouts back in as conditioning so a fresh rollout starts from what previous attempts learned. The summary, not the trajectory, is the object that gets ranked, voted on, and carried forward, which makes long-horizon outputs comparable at a fraction of their token cost.

What this pattern forbids. Test-time selection and reuse operate only on the structured rollout summaries, never on the raw trajectories directly; a rollout cannot be ranked, voted on, or carried forward until it has been compressed into the summary representation.

And the patterns that stand alongside it, or against it —

alternative-toBest-of-N Sampling★— Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
alternative-toSelf-Consistency★★— Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.
complementsEpisodic Summaries★★— Compress past episodes into summaries that preserve gist while shedding token cost.
complementsReasoning Trace Carry-Forward★— For reasoning models that emit a separate reasoning trace, preserve that trace in context across the same logical task episode (across tool-call/result turns) but drop it at user-turn boundaries.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Scaling Test-Time Compute for Agentic Coding
paper

Provenance

Source: patterns/rollout-summary-test-time-scaling.md on GitHub · commit 7012173 · view history
Added to catalog: 2026-06-17
Last updated: 2026-06-17
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.