Compound Error Degradation
Anti-pattern: deploy a long-horizon agent without modelling that per-step accuracy multiplies across the trajectory.
Problem
Per-step success multiplies across an agent's trajectory. A 95%-per-step pipeline ends 10 steps later at roughly 60% and 100 steps later at well under 1%. The end-to-end task success the user actually experiences therefore falls off a cliff that the per-step benchmark hid. Teams ship long-horizon agents whose per-step traces look healthy in evaluation but whose realised end-to-end task success on production traffic is unworkable, and the cause is never observable from any single step. The fix is not a better single step — it is fewer steps, better step-level recovery, or a much stronger per-step model.
Solution
Model end-to-end task success as the product of per-step successes (after any per-step recovery). Either cap the step count so the product clears the user-visible success bar, or raise effective per-step success with verifiers, retries, and intermediate checkpoints. Treat raw per-step accuracy on a benchmark as a ceiling, not a forecast.
When to use
- Naming this anti-pattern when reviewing a long-horizon agent proposal.
- Per-step benchmarks look healthy but end-to-end success on production traffic does not.
- A long pipeline is being proposed with no step budget and no per-step verifier.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.