Methodology · Deployment & Operationsemergingverified

Production Failure-Mode Optimization

also known as failure-mode checklist, multi-agent optimization checklist

Applies to: multi-agent-systemagent

Tags: failure-modesmulti-agent-optimizationchecklistpost-deployment

Tune a live multi-agent system by going down a named checklist of common failure modes and fixing each one with a targeted change. The checklist covers things like vague instructions, a model that is too small, a prompt that does not suit the model, weak tools, no rule for when to stop, the wrong design, no memory, no self-checking, no tests, and no path to hand off to a human. This turns a vague 'the system is underperforming' into a list of specific changes you can test and prove or disprove. It treats each failure mode as a chance to improve, not as an incident to firefight. The checklist is short on purpose. It covers the failures that explain most poor performance, and does not try to list every possible one.

Methodology process overview

Intent. Find and fix what is wrong in a live multi-agent system by going down a named failure-mode checklist and making one targeted change per mode.

When to apply. Use this when a multi-agent system is live, performing below expectation, and the team needs a structured way to find what is broken instead of guessing. The system must be observable enough to inspect traces for each mode. Don't apply it before launch as a design checklist. The modes are for diagnosis, not for building. Build with the right patterns first, then use this to triage problems in the running system. One exception: a post-mortem of a closed beta with representative traffic.

Inputs

  • Deployed multi-agent system with tracesA running system where you can inspect each agent's traces, tool calls, and outputs per session.
  • Failure-mode checklistThe named list: vague instructions, wrong model size, prompt-model mismatch, weak tools, no stop criterion, wrong pattern, no memory, no metacognition, no evals, no human-delegation.
  • Performance baselineA snapshot of the metrics before you change anything, taken per agent and per task, so you can credit each fix to its change.

Outputs

  • Mode-by-mode diagnosisFor each checklist mode, evidence from the traces of whether it applies and how strongly.
  • Targeted interventionsOne fix per confirmed mode: clearer instructions, a model swap, an extra tool, a stop rule, a design change, added memory, a self-check loop, a test, or a human handoff gate.
  • Post-intervention metric reportThe change in performance from each fix, used to keep or undo it.

Steps (6)

  1. Snapshot baseline performance

    Capture the current metrics per agent and per task. Without a baseline, you cannot credit a fix to its change.

  2. Walk the checklist

    For each named mode, inspect the traces and outputs to decide whether it applies. Mark each one confirmed, ruled-out, or uncertain. Uncertain modes need more telemetry before you act.

  3. Rank confirmed modes by expected leverage

    Some modes usually explain the biggest gaps, such as vague instructions, a missing stop rule, and the wrong design. Rank the confirmed modes so the first fix has the highest expected payoff.

  4. Apply one intervention per cycle

    Make a single targeted change per cycle, such as clearer instructions, a bigger model, an extra tool, a stop rule, a design change, added memory, a self-check loop, tests, or a human handoff gate. Change one thing at a time, so the metric shift has a single clear cause.

  5. Measure, accept, or revert

    Re-run the test or a production sample. If the target metric improved and nothing else got worse, keep the change. If not, undo it and move to the next ranked mode.

  6. Re-baseline and continue

    After you keep a fix, snapshot the new baseline before the next cycle. The checklist is not one-shot. Modes can come back as the system grows.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Failure modes are chances to improve, not incidents. Name them so you can act on them.
  • Diagnose from the traces before you change code. Guesses about which mode applies are usually wrong.
  • One change per cycle. Changing several things at once makes the metric shift impossible to read.
  • The checklist is finite on purpose. It covers the modes that explain most poor performance, and accepts it is not exhaustive.

Known failure modes (2)

Related patterns (4)

Sources (3)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified