Compound Error Degradation
also known as Per-Step Accuracy Collapse, Multiplicative Error, Long-Horizon Error Compounding
Anti-pattern: deploy a long-horizon agent without modelling that per-step accuracy multiplies across the trajectory.
Context
A team has measured that the underlying model resolves single isolated tool calls or sub-tasks at a respectable per-step success rate — say 95%. They scale the agent up to a 20-step or 100-step pipeline (research loops, code-agent sessions, autonomous browser flows), assuming aggregate quality will track per-step quality.
Problem
Per-step success multiplies across an agent's trajectory. A 95%-per-step pipeline ends 10 steps later at roughly 60% and 100 steps later at well under 1%. The end-to-end task success the user actually experiences therefore falls off a cliff that the per-step benchmark hid. Teams ship long-horizon agents whose per-step traces look healthy in evaluation but whose realised end-to-end task success on production traffic is unworkable, and the cause is never observable from any single step. The fix is not a better single step — it is fewer steps, better step-level recovery, or a much stronger per-step model.
Forces
- Per-step benchmarks make the model look good while end-to-end task success collapses.
- Longer horizons amplify any per-step error; doubling steps roughly squares the failure rate.
- Adding recovery (verifier, retry, checkpoint) raises the effective per-step success above the raw model's rate.
- Cutting the step count by fusing or pre-computing actions has more impact than improving the model.
Example
A team benchmarks their planner-executor at 92% per-step accuracy on a curated tool-call dataset and ships it on a workflow that averages 30 steps. Realised end-to-end task completion comes back at ~8%. The team assumed step quality would carry the pipeline; the multiplicative product (0.92^30 ≈ 0.082) was the actual ceiling. They cut step count by fusing common tool pairs and add a verifier that lifts effective per-step success to ~98%; aggregate rises to ~55%.
Diagram
Solution
Therefore:
Model end-to-end task success as the product of per-step successes (after any per-step recovery). Either cap the step count so the product clears the user-visible success bar, or raise effective per-step success with verifiers, retries, and intermediate checkpoints. Treat raw per-step accuracy on a benchmark as a ceiling, not a forecast.
What this pattern forbids. Per-step accuracy on a benchmark must not be used as a forecast of end-to-end agent success; the product over the trajectory bounds what the agent can deliver.
And the patterns that stand alongside it, or against it —
- complementsStep Budget★★— Cap the number of tool calls or loop iterations the agent is allowed within a single request.
- alternative-toTool Transition Fusion·— Mine tool-call telemetry for high-probability X-then-Y transitions and fuse those pairs into a single composite tool, shrinking the planner's step count.
- complementsEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.