Reasoning

Trajectory-Summary Test-Time Scaling

When an agent's outputs are extended action-observation trajectories rather than short answers, scale test-time compute by compressing each rollout into a structured summary and selecting or reusing across those summaries instead of raw traces.

Problem

Best-of-N and self-consistency assume the candidates are short outputs that a reward model can rank or a vote can aggregate. A hundred-step trajectory is neither: its salient content — the hypotheses tried, the progress made, the dead ends — is buried in low-signal trace detail, so naive ranking compares noise and naive voting has nothing to count. Without a comparable representation, extra rollouts add cost but not a reliable way to pick or combine the best one.

Solution

Run several rollouts of the long task, and convert each into a structured summary that keeps its salient hypotheses, progress, and failure modes while shedding low-signal trace detail. Scale in two directions over these summaries. For parallel scaling, compare summaries against each other — for example by recursive tournament voting — to select the strongest attempt without ever diffing raw traces. For sequential scaling, feed the summaries of earlier rollouts back in as conditioning so a fresh rollout starts from what previous attempts learned. The summary, not the trajectory, is the object that gets ranked, voted on, and carried forward, which makes long-horizon outputs comparable at a fraction of their token cost.

When to use

The task is long-horizon, so each attempt is a trajectory too long to compare or vote on verbatim.
More inference compute is available and expected to raise success if the good attempts can be identified.
A reliable summariser can capture an attempt's hypotheses, progress, and failures.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related