Methodology · Deployment & Operations

Production Failure-Mode Optimization

Find and fix what is wrong in a live multi-agent system by going down a named failure-mode checklist and making one targeted change per mode.

Description

Tune a live multi-agent system by going down a named checklist of common failure modes and fixing each one with a targeted change. The checklist covers things like vague instructions, a model that is too small, a prompt that does not suit the model, weak tools, no rule for when to stop, the wrong design, no memory, no self-checking, no tests, and no path to hand off to a human. This turns a vague 'the system is underperforming' into a list of specific changes you can test and prove or disprove. It treats each failure mode as a chance to improve, not as an incident to firefight. The checklist is short on purpose. It covers the failures that explain most poor performance, and does not try to list every possible one.

When to apply

Use this when a multi-agent system is live, performing below expectation, and the team needs a structured way to find what is broken instead of guessing. The system must be observable enough to inspect traces for each mode. Don't apply it before launch as a design checklist. The modes are for diagnosis, not for building. Build with the right patterns first, then use this to triage problems in the running system. One exception: a post-mortem of a closed beta with representative traffic.

What it involves

  • Snapshot baseline performance
  • Walk the checklist
  • Rank confirmed modes by expected leverage
  • Apply one intervention per cycle
  • Measure, accept, or revert
  • Re-baseline and continue

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related