Methodology · Deployment & Operationsemergingverified

Production Failure-Mode Optimization

also known as failure-mode checklist, multi-agent optimization checklist

Applies to: multi-agent-systemagent

Tags: failure-modesmulti-agent-optimizationchecklistpost-deployment

Tune a live multi-agent system by going down a named checklist of common failure modes and fixing each one with a targeted change. The checklist covers things like vague instructions, a model that is too small, a prompt that does not suit the model, weak tools, no rule for when to stop, the wrong design, no memory, no self-checking, no tests, and no path to hand off to a human. This turns a vague 'the system is underperforming' into a list of specific changes you can test and prove or disprove. It treats each failure mode as a chance to improve, not as an incident to firefight. The checklist is short on purpose. It covers the failures that explain most poor performance, and does not try to list every possible one.

Methodology process overview

flowchart TD mas[Deployed multi-agent system] --> s1[Snapshot baseline performance] s1 --> base[Per-agent / per-task metric baseline] checklist[Failure-mode checklist] --> s2[Walk the checklist against traces] traces[Traces, tool calls, outputs] --> s2 s2 --> labeled[Each mode: confirmed / ruled-out / uncertain] labeled --> uncertain{Uncertain modes?} uncertain -->|yes| more_tel[Collect more telemetry] more_tel --> s2 uncertain -->|no| s3[Rank confirmed modes by leverage] s3 --> ranked[Ranked intervention list] ranked --> s4[Apply ONE intervention per cycle] s4 --> intervention[Single targeted change] intervention --> s5[Measure delta vs baseline] s5 --> accept{Targeted metric up, no regression?} accept -->|yes| s6[Re-baseline] accept -->|no| revert[Revert change] revert --> s4 s6 --> done{More modes to address?} done -->|yes| s4 done -->|no| stable[System optimized for this round]

Intent. Find and fix what is wrong in a live multi-agent system by going down a named failure-mode checklist and making one targeted change per mode.

When to apply. Use this when a multi-agent system is live, performing below expectation, and the team needs a structured way to find what is broken instead of guessing. The system must be observable enough to inspect traces for each mode. Don't apply it before launch as a design checklist. The modes are for diagnosis, not for building. Build with the right patterns first, then use this to triage problems in the running system. One exception: a post-mortem of a closed beta with representative traffic.

Example scenario

A retail-ops team owns a four-agent product-content multi-agent system: a research agent gathers product specs from supplier feeds, a copy agent drafts marketing descriptions, a compliance agent checks claims against regulator-allowed language, and a coordinator agent assembles the final SKU page. Quality has been quietly drifting — internal reviewers report 'the descriptions feel off' and the compliance agent flags more than it should. Nobody can point to a specific bug. The team applies the checklist. Baseline snapshot: per-agent success rates and per-SKU end-to-end pass rate over the last 30 days. Then the walk. Vague instructions — research agent prompt confirmed (it says 'gather useful spec info' with no schema); copy agent ruled out; coordinator uncertain. Wrong model size — copy agent confirmed (running on the smallest model; tone failures cluster there). Prompt-model mismatch — ruled out. Weak tools — research agent has only one supplier-feed tool; for products from EU suppliers a second tool is needed; confirmed. No stop criterion — coordinator confirmed (loops on compliance disagreements indefinitely). No metacognition — confirmed across the board. No evals — partially confirmed (only end-to-end pass rate; no per-agent evals). The team ranks: vague-instructions and missing-stop-criterion typically explain the largest gaps and are cheap. Cycle 1: rewrite research-agent prompt with a strict schema; baseline-per-cycle metric (research-agent success rate) lifts from 71% to 89%; accept. Cycle 2: add a 5-turn stop criterion plus escalate-to-human on the coordinator; end-to-end pass rate lifts from 64% to 78%; accept. Cycle 3: upgrade copy agent to a stronger model; tone-failure rate drops; accept. Cycle 4: add the EU supplier tool; partial improvement but introduces a new latency regression on non-EU SKUs; revert and shelve for a follow-up cycle. Five cycles in, end-to-end pass rate is 87% with attribution per change. The team didn't add more tools to chase 'weak tools', conscious that tool-explosion is its own checklist failure mode.

Inputs

Deployed multi-agent system with traces — A running system where you can inspect each agent's traces, tool calls, and outputs per session.
Failure-mode checklist — The named list: vague instructions, wrong model size, prompt-model mismatch, weak tools, no stop criterion, wrong pattern, no memory, no metacognition, no evals, no human-delegation.
Performance baseline — A snapshot of the metrics before you change anything, taken per agent and per task, so you can credit each fix to its change.

Outputs

Mode-by-mode diagnosis — For each checklist mode, evidence from the traces of whether it applies and how strongly.
Targeted interventions — One fix per confirmed mode: clearer instructions, a model swap, an extra tool, a stop rule, a design change, added memory, a self-check loop, a test, or a human handoff gate.
Post-intervention metric report — The change in performance from each fix, used to keep or undo it.

Steps (6)

Snapshot baseline performance
Capture the current metrics per agent and per task. Without a baseline, you cannot credit a fix to its change.
Walk the checklist
For each named mode, inspect the traces and outputs to decide whether it applies. Mark each one confirmed, ruled-out, or uncertain. Uncertain modes need more telemetry before you act.
Rank confirmed modes by expected leverage
Some modes usually explain the biggest gaps, such as vague instructions, a missing stop rule, and the wrong design. Rank the confirmed modes so the first fix has the highest expected payoff.
Apply one intervention per cycle
Make a single targeted change per cycle, such as clearer instructions, a bigger model, an extra tool, a stop rule, a design change, added memory, a self-check loop, tests, or a human handoff gate. Change one thing at a time, so the metric shift has a single clear cause.
Measure, accept, or revert
Re-run the test or a production sample. If the target metric improved and nothing else got worse, keep the change. If not, undo it and move to the next ranked mode.
Re-baseline and continue
After you keep a fix, snapshot the new baseline before the next cycle. The checklist is not one-shot. Modes can come back as the system grows.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Failure modes are chances to improve, not incidents. Name them so you can act on them.
Diagnose from the traces before you change code. Guesses about which mode applies are usually wrong.
One change per cycle. Changing several things at once makes the metric shift impossible to read.
The checklist is finite on purpose. It covers the modes that explain most poor performance, and accepts it is not exhaustive.

Production Failure-Mode Optimization

Methodology process overview

Steps (6)

Snapshot baseline performance

Walk the checklist

Rank confirmed modes by expected leverage

Apply one intervention per cycle

Measure, accept, or revert

Re-baseline and continue

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (4)

Sources (3)

Provenance