Compound Error Degradation

also known as Per-Step Accuracy Collapse, Multiplicative Error, Long-Horizon Error Compounding

Anti-pattern: deploy a long-horizon agent without modelling that per-step accuracy multiplies across the trajectory.

Context

A team has measured that the underlying model resolves single isolated tool calls or sub-tasks at a respectable per-step success rate — say 95%. They scale the agent up to a 20-step or 100-step pipeline (research loops, code-agent sessions, autonomous browser flows), assuming aggregate quality will track per-step quality.

Problem

Per-step success multiplies across an agent's trajectory. A 95%-per-step pipeline ends 10 steps later at roughly 60% and 100 steps later at well under 1%. The end-to-end task success the user actually experiences therefore falls off a cliff that the per-step benchmark hid. Teams ship long-horizon agents whose per-step traces look healthy in evaluation but whose realised end-to-end task success on production traffic is unworkable, and the cause is never observable from any single step. The fix is not a better single step — it is fewer steps, better step-level recovery, or a much stronger per-step model.

Forces

Per-step benchmarks make the model look good while end-to-end task success collapses.
Longer horizons amplify any per-step error; doubling steps roughly squares the failure rate.
Adding recovery (verifier, retry, checkpoint) raises the effective per-step success above the raw model's rate.
Cutting the step count by fusing or pre-computing actions has more impact than improving the model.

Example

A team benchmarks their planner-executor at 92% per-step accuracy on a curated tool-call dataset and ships it on a workflow that averages 30 steps. Realised end-to-end task completion comes back at ~8%. The team assumed step quality would carry the pipeline; the multiplicative product (0.92^30 ≈ 0.082) was the actual ceiling. They cut step count by fusing common tool pairs and add a verifier that lifts effective per-step success to ~98%; aggregate rises to ~55%.

Diagram

flowchart TD P[Per-step success p] N[Steps per task n] P --> Mult[Aggregate ≈ p^n] N --> Mult Mult --> Drop{Below bar?} Drop -- yes --> Cap[Cap step count] Drop -- yes --> Ver[Add per-step verifier] Drop -- yes --> Fuse[Fuse step pairs] Drop -- no --> Ship[Ship]

Solution

Therefore:

Model end-to-end task success as the product of per-step successes (after any per-step recovery). Either cap the step count so the product clears the user-visible success bar, or raise effective per-step success with verifiers, retries, and intermediate checkpoints. Treat raw per-step accuracy on a benchmark as a ceiling, not a forecast.

What this pattern forbids. Per-step accuracy on a benchmark must not be used as a forecast of end-to-end agent success; the product over the trajectory bounds what the agent can deliver.

The patterns that counter or replace it —

complementsStep Budget★★— Cap the number of tool calls or loop iterations the agent is allowed within a single request.
alternative-toTool Transition Fusion·— Mine tool-call telemetry for high-probability X-then-Y transitions and fuse those pairs into a single composite tool, shrinking the planner's step count.
complementsEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
complementsAgent Bullwhip Effect✕— Anti-pattern: distributed supply-chain or replenishment agents, each optimising locally, amplify order variability through their own decision policy, so a local demand spike triggers synchronised chain-wide reordering and supplier stockouts that propagate backward.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance

Source: patterns/compound-error-degradation.md on GitHub · commit 135ae3c · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.