← All booksBook X

Governance & Observability

The patterns that let humans trust the agent over time.

38 patterns in this book. · Updated 2026-06-19

Top 5 patterns in Governance & Observability by usage

↓ download as png

AGENT PATTERNS · BOOK X · GOVERNANCE & OBSERVABILITY

Top 5 patterns by usage

agentpatternscatalog.org

Agent Resumption
a.k.a. Durable Execution · Pause-and-Resume
Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects.
×38 compositions
Eval Harness
a.k.a. Golden Dataset Suite · Champion-Challenger
Run a held-out dataset against agent versions to detect regressions and measure improvement.
×17 compositions
Decision Log
a.k.a. Reasoning Trace · Thought Trace
Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.
×14 compositions
LLM-as-Judge
a.k.a. Model Grading · Auto-Evaluator
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
×10 compositions
diagram coming
Replay / Time-Travel
a.k.a. Trace Replay · Run Branching
Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.
×7 compositions

When to reach for each

01. Agent Resumption Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects. Best for: Agent runs are long enough that restarts, deploys, or disconnects would lose meaningful work. Tradeoff: Checkpoint storage cost. Watch for: Runs complete in seconds and can simply be retried from scratch.

02. Eval Harness Run a held-out dataset against agent versions to detect regressions and measure improvement. Best for: A change that 'feels better' is regressing quality silently in your system. Tradeoff: Dataset bias means high scores can hide real-world failures. Watch for: No expected outputs exist (open-ended creative tasks) and scoring would be subjective.

03. Decision Log Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why. Best for: Action-only logs leave you unable to explain why the agent did something. Tradeoff: Storage and privacy implications. Watch for: Reasoning logs would be retained without any review process consulting them.

04. LLM-as-Judge Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies. Best for: Open-ended outputs need automated regression detection without a reference answer. Tradeoff: Judge biases skew scores in subtle ways. Watch for: An exact-match or reference metric already grades the task.

05. Replay / Time-Travel Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch. Best for: Agent runs are non-deterministic and incidents need reproducible debugging. Tradeoff: Trace storage overhead. Watch for: Trace storage cost outweighs the value of replay (low-stakes ephemeral runs).

When to reach for each

All patterns in this book