VIII · Safety & ControlMature★★

Exception Handling and Recovery

also known as Error Recovery, Failure Mode Handler

Catch and react to predictable failure modes (tool errors, rate limits, validation failures) with structured recovery paths.

Context

A team runs a production agent that calls many tools in a loop: search APIs, internal databases, third-party services, model endpoints. In real traffic those tools fail in predictable, repeating ways — the API is briefly down, the caller hit a rate limit, the response came back malformed, the credential was rejected, the request timed out. Each of those failure modes wants a different response from the agent.

Problem

If the tool layer returns errors as opaque strings stuffed back into the conversation, the agent treats them as text and reacts with whatever the model invents — sometimes a retry, sometimes a confident hallucinated explanation to the user, sometimes a stall. The agent has no way to branch deterministically on a rate-limit versus a validation error, so it cannot back off correctly on the first or replan on the second. Without typed errors and named recovery branches, the team is forced to choose between blanket retries that mask real bugs and giving up on partial-failure handling altogether.

Forces

  • Recovery logic must not mask bugs.
  • Some errors are user-visible; others should be silent.
  • Retry storms on transient errors.

Example

A research agent calls a search tool that returns a rate-limit error. Without typed handling the error string flows back into the conversation as an opaque blob; the agent invents a plausible-sounding explanation and stalls. The team adds Exception Recovery: each tool wraps known failure modes (rate-limit, auth, validation, timeout) into typed error envelopes, and the agent's prompt has explicit recovery branches — back off and retry on rate-limit, switch tool on validation, escalate on auth. Failures stop becoming silent confusion.

Diagram

Solution

Therefore:

Catalogue failure modes. For each, define: detect (typed error), respond (retry / fall back / surface to user / replan), and log. The agent receives a structured error message and can react with a typed branch in its loop.

What this pattern forbids. Errors must arrive at the agent as typed events from the catalogue; untyped errors are escalated to the operator.

The smaller patterns that complete this one —

  • generalisesGraceful Degradation★★When a dependency fails, downgrade the user-facing experience to a working subset rather than failing entirely.

And the patterns that stand alongside it, or against it —

  • complementsFallback Chain★★Try a primary handler; on failure or low confidence, fall through to a sequence of fallback handlers.
  • complementsCircuit Breaker★★Stop calling a failing dependency for a cooldown period after error rates exceed a threshold.
  • complementsReplan on Failure★★Trigger a fresh planning step when execution evidence contradicts the current plan.
  • complementsMissing Idempotency on Agent CallsAnti-pattern: retry state-mutating agent tool calls without idempotency keys, so retries multiply real-world side effects.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.