Production Failure-Mode Optimization
also known as failure-mode checklist, multi-agent optimization checklist
Tune a live multi-agent system by going down a named checklist of common failure modes and fixing each one with a targeted change. The checklist covers things like vague instructions, a model that is too small, a prompt that does not suit the model, weak tools, no rule for when to stop, the wrong design, no memory, no self-checking, no tests, and no path to hand off to a human. This turns a vague 'the system is underperforming' into a list of specific changes you can test and prove or disprove. It treats each failure mode as a chance to improve, not as an incident to firefight. The checklist is short on purpose. It covers the failures that explain most poor performance, and does not try to list every possible one.
Methodology process overview
Intent. Find and fix what is wrong in a live multi-agent system by going down a named failure-mode checklist and making one targeted change per mode.
When to apply. Use this when a multi-agent system is live, performing below expectation, and the team needs a structured way to find what is broken instead of guessing. The system must be observable enough to inspect traces for each mode. Don't apply it before launch as a design checklist. The modes are for diagnosis, not for building. Build with the right patterns first, then use this to triage problems in the running system. One exception: a post-mortem of a closed beta with representative traffic.
Inputs
- Deployed multi-agent system with traces — A running system where you can inspect each agent's traces, tool calls, and outputs per session.
- Failure-mode checklist — The named list: vague instructions, wrong model size, prompt-model mismatch, weak tools, no stop criterion, wrong pattern, no memory, no metacognition, no evals, no human-delegation.
- Performance baseline — A snapshot of the metrics before you change anything, taken per agent and per task, so you can credit each fix to its change.
Outputs
- Mode-by-mode diagnosis — For each checklist mode, evidence from the traces of whether it applies and how strongly.
- Targeted interventions — One fix per confirmed mode: clearer instructions, a model swap, an extra tool, a stop rule, a design change, added memory, a self-check loop, a test, or a human handoff gate.
- Post-intervention metric report — The change in performance from each fix, used to keep or undo it.
Steps (6)
Snapshot baseline performance
Capture the current metrics per agent and per task. Without a baseline, you cannot credit a fix to its change.
Walk the checklist
For each named mode, inspect the traces and outputs to decide whether it applies. Mark each one confirmed, ruled-out, or uncertain. Uncertain modes need more telemetry before you act.
Rank confirmed modes by expected leverage
Some modes usually explain the biggest gaps, such as vague instructions, a missing stop rule, and the wrong design. Rank the confirmed modes so the first fix has the highest expected payoff.
Apply one intervention per cycle
Make a single targeted change per cycle, such as clearer instructions, a bigger model, an extra tool, a stop rule, a design change, added memory, a self-check loop, tests, or a human handoff gate. Change one thing at a time, so the metric shift has a single clear cause.
Measure, accept, or revert
Re-run the test or a production sample. If the target metric improved and nothing else got worse, keep the change. If not, undo it and move to the next ranked mode.
Re-baseline and continue
After you keep a fix, snapshot the new baseline before the next cycle. The checklist is not one-shot. Modes can come back as the system grows.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Failure modes are chances to improve, not incidents. Name them so you can act on them.
- Diagnose from the traces before you change code. Guesses about which mode applies are usually wrong.
- One change per cycle. Changing several things at once makes the metric shift impossible to read.
- The checklist is finite on purpose. It covers the modes that explain most poor performance, and accepts it is not exhaustive.
Known failure modes (2)
Related patterns (4)
- ★★Eval Harness
Run a held-out dataset against agent versions to detect regressions and measure improvement.
- ★★Human-in-the-Loop
Require explicit human approval at defined points before the agent performs an action.
- ★★Step Budget
Cap the number of tool calls or loop iterations the agent is allowed within a single request.
- ·Reflexive Metacognitive Agent
Agent maintains an explicit self-model of its own capabilities, confidence and limitations, and reasons over that model when accepting / refusing / handing off tasks.
Sources (3)
Designing Multi-Agent Systems
Ch 11 'Optimizing Multi-Agent Systems' — Failure Modes as Optimization Opportunities “optimization strategies for the 10 common failure modes (Chapter 11)”
10 Reasons Your Multi-Agent Workflows Fail and What You Can Do about It (Victor Dibia, InfoQ)
Companion talk for Ch 11 of 'Designing Multi-Agent Systems' “Your agent lacks detailed instructions ... Stop using small models ... Your agent instructions don't match your LLM ... Your agents lack good tools ... Your agents don't know when to stop ... You have the wrong multi-agent pattern ... Your…”
Vibe Coding... With Engineering Discipline (Victor Dibia newsletter)
Author summary of book Ch 11 “optimization strategies for the 10 common failure modes (Chapter 11)”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified