← All booksBook X

Governance & Observability

The patterns that let humans trust the agent over time.

27 patterns in this book. · Updated

↓ download as png

When to reach for each

01. Agent Resumption Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects. Best for: Agent runs are long enough that restarts, deploys, or disconnects would lose meaningful work. Tradeoff: Checkpoint storage cost. Watch for: Runs complete in seconds and can simply be retried from scratch.

02. Decision Log Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why. Best for: Action-only logs leave you unable to explain why the agent did something. Tradeoff: Storage and privacy implications. Watch for: Reasoning logs would be retained without any review process consulting them.

03. Eval Harness Run a held-out dataset against agent versions to detect regressions and measure improvement. Best for: A change that 'feels better' is regressing quality silently in your system. Tradeoff: Dataset bias means high scores can hide real-world failures. Watch for: No expected outputs exist (open-ended creative tasks) and scoring would be subjective.

04. Replay / Time-Travel Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch. Best for: Agent runs are non-deterministic and incidents need reproducible debugging. Tradeoff: Trace storage overhead. Watch for: Trace storage cost outweighs the value of replay (low-stakes ephemeral runs).

05. Agent-as-a-Judge Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output. Best for: Agent tasks succeed or fail along their trajectory in ways the final answer cannot reveal. Tradeoff: Cost: trajectory evaluation is expensive. Watch for: Only the final output is checkable and the trajectory carries no evaluable structure.

All patterns in this book

Agent Resumption

×38

Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects.

Decision Log

×8

Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.

Eval Harness

×8

Run a held-out dataset against agent versions to detect regressions and measure improvement.

Replay / Time-Travel

×7

Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.

Agent-as-a-Judge

×5

Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.

Lineage Tracking

×4

Track which prompt version, model version, and data sources produced each agent output.

LLM-as-Judge

×3

Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

Provenance Ledger

×3

Log every agent decision and state change with enough metadata to explain or reverse it later.

Durable Workflow Snapshot

×3

Capture workflow execution state as a snapshot in a pluggable storage provider so a paused run can resume across deployments, process restarts, and host crashes.

Cost Observability

×2

Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.

Shadow Canary

×2

Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.

Agent Middleware Chain

×2

Wrap every model call, tool call, and memory access in a composable pre/execute/post interceptor pipeline so cross-cutting concerns attach without touching agent or orchestrator code.

Sandbox Escape Monitoring

×2

Treat sandbox boundary violations as telemetry; alert on syscalls, network egress, or filesystem writes outside expected scope.

Eval as Contract

×1

Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.

Sampled Prompt Trace Eval

×1

Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.

Multi-Principal Welfare Aggregation

×1

When an agent serves multiple humans with conflicting preferences, declare the aggregation rule explicitly rather than letting it be implicit in the prompt or fine-tune.

Prompt Versioning

Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.

Agent Evaluator

A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).

Agent Factory

Manufacture agent instances from a versioned template that renders model, tools, and prompt atomically, with registry-backed identities, so a fleet stays consistent and one template change propagates…

Agentic Golden Path

Constrain an agent to the platform's curated golden path of living, machine-readable standards and check for drift as it works, so its output is compliant by construction rather than corrected later.

Bayesian Bandit Experimentation

Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.

Dual Evaluation (Offline + Online)

Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Intermediate Artifact Evaluation

Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.

Own Your Prompts (12-Factor Agents)

Every prompt in a production agent is versioned, tested, and owned by the team in the application repo — never inherited as a framework default.

Rigor Relocation

Relocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.

Scorer Live Monitoring

Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.

Attention-Manipulation Explainability

Surface which input tokens caused a given output by perturbing attention across all transformer layers and measuring the resulting change in output probability, producing a per-token relevance map al…