Trajectory-Summary Test-Time Scaling
When an agent's outputs are extended action-observation trajectories rather than short answers, scale test-time compute by compressing each rollout into a structured summary and selecting or reusing across those summaries instead of raw traces.
Problem
Best-of-N and self-consistency assume the candidates are short outputs that a reward model can rank or a vote can aggregate. A hundred-step trajectory is neither: its salient content — the hypotheses tried, the progress made, the dead ends — is buried in low-signal trace detail, so naive ranking compares noise and naive voting has nothing to count. Without a comparable representation, extra rollouts add cost but not a reliable way to pick or combine the best one.
Solution
Run several rollouts of the long task, and convert each into a structured summary that keeps its salient hypotheses, progress, and failure modes while shedding low-signal trace detail. Scale in two directions over these summaries. For parallel scaling, compare summaries against each other — for example by recursive tournament voting — to select the strongest attempt without ever diffing raw traces. For sequential scaling, feed the summaries of earlier rollouts back in as conditioning so a fresh rollout starts from what previous attempts learned. The summary, not the trajectory, is the object that gets ranked, voted on, and carried forward, which makes long-horizon outputs comparable at a fraction of their token cost.
When to use
- The task is long-horizon, so each attempt is a trajectory too long to compare or vote on verbatim.
- More inference compute is available and expected to raise success if the good attempts can be identified.
- A reliable summariser can capture an attempt's hypotheses, progress, and failures.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.