Safety & Control
Hard limits on what the agent may do.
48 patterns in this book. · Updated
When to reach for each
01. Step Budget Cap the number of tool calls or loop iterations the agent is allowed within a single request. Best for: The agent has any kind of loop (ReAct, plan-execute, debate). Tradeoff: Can hide deeper bugs (the agent really should stop earlier). Watch for: Never. Step Budget is universal hardening for any agent loop.
02. Approval Queue Queue agent-proposed actions for asynchronous human review while the agent continues other work. Best for: Some agent actions require human review but blocking the agent until review completes is unacceptable. Tradeoff: Inbox fatigue at scale. Watch for: Every action needs synchronous approval and there is no parallel work to do.
03. Human-in-the-Loop Require explicit human approval at defined points before the agent performs an action. Best for: Action consequences at a defined boundary are too costly to leave to the model alone. Tradeoff: User experience friction. Watch for: Decisions must be made in unattended or sub-second autonomous settings.
04. Conversation Handoff to Human Transfer the entire conversation thread from agent to human operator, with state transfer and return primitive. Best for: Some triggers (low confidence, policy violation, explicit user request) demand transferring ownership of the whole thread, not just one action. Tradeoff: Operator queue capacity bounds scale. Watch for: Discrete-action approval is sufficient and full thread transfer is overkill (use approval-queue).
05. Input/Output Guardrails Validate inputs before they reach the model and outputs before they reach the user. Best for: User inputs may carry malicious or out-of-policy content the model should not act on. Tradeoff: False positives are user-visible. Watch for: The deployment is fully internal and validated by other layers already.
All patterns in this book
Step Budget
×35Cap the number of tool calls or loop iterations the agent is allowed within a single request.
Approval Queue
×33Queue agent-proposed actions for asynchronous human review while the agent continues other work.
Human-in-the-Loop
×18Require explicit human approval at defined points before the agent performs an action.
Conversation Handoff to Human
×14Transfer the entire conversation thread from agent to human operator, with state transfer and return primitive.
Input/Output Guardrails
×5Validate inputs before they reach the model and outputs before they reach the user.
Kill Switch
×5Provide an out-of-band control plane to halt running agent instances without redeploy.
Composable Termination Conditions
×4Express agent stop criteria as small single-purpose conditions composed with AND/OR into one explicit termination contract instead of ad-hoc loop guards.
Constitutional Charter
×4Define rules the agent reads every turn but cannot modify, encoding inviolable boundaries.
Compensating Action
×3Pair every irreversible-looking agent action with a compensating action that can undo or counteract it.
Interruptible Agent Execution
×3Treat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.
Cost Gating
×2Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
Rate Limiting
×2Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
Cost-Aware Action Delegation
×2Classify every agent action by risk/cost and route each tier to a different approval policy, bounding the autonomy surface per-action instead of by one global flag.
Prompt Injection Defense
×2Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.
Exception Handling and Recovery
×1Catch and react to predictable failure modes (tool errors, rate limits, validation failures) with structured recovery paths.
Refusal
×1Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
Autonomy Slider
×1Expose agent autonomy as a continuous adjustable parameter so the same codebase can span scripted assistant to fully autonomous worker without re-architecting.
Sovereign Inference Stack
×1Run the entire agent stack (model weights, inference, tool layer, vector stores, logs) inside a jurisdictional and operational boundary the operator controls, so no request, prompt, or output crosses…
Tool Output Poisoning Defense
×1Treat tool output as untrusted content and apply instruction-stripping plus per-tool trust labels.
Corrigible Off-Switch Incentive
×1Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
Preference-Uncertain Agent
×1Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
Risk-Averse Reward Proxy
×1When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.
Soft-Optimization Cap
×1Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
PII Redaction
Detect and remove personally identifiable information from inputs to and outputs from the model.
Stop Hook
Define an explicit programmatic predicate that decides when the agent's loop should terminate.
Action Selector Pattern
Eliminate the feedback channel from tool outputs back into the agent's reasoning step by having the agent select actions from a fixed catalog rather than free-form generation over tool output.
Code-Then-Execute with Dataflow Analysis
Have the agent emit code in a sandbox DSL whose values are statically tagged trusted/tainted via dataflow analysis before execution, enabling per-value policy enforcement.
Context Minimization
Reduce untrusted input to a strictly formatted interface (typed fields, max lengths, allow-listed enums) before it reaches any LLM.
Control-Flow Integrity
Treat the agent's planned step sequence as a trusted control-flow graph that tool outputs, retrieved content, and user-supplied data cannot redirect at runtime.
Degenerate-Output Detection
Detect when the agent is about to emit a near-duplicate of its own recent output and either drop, replace, or escalate to a stronger model rather than ship the loop.
Delegated Agent Authorization
Have an agent act for a principal using scoped, short-lived, revocable delegated credentials rather than the principal's own static secrets, so each action stays attributable across the principal-to-…
Dry-Run Harness
Simulate planned actions (and their projected side effects) without committing them, surfacing a reviewable diff before any commit.
Dual LLM Pattern
Split agent work between a privileged model that holds tool access and a quarantined model that reads untrusted content, exchanging only opaque references between them.
Lethal Trifecta Threat Model
Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.
LLM Map-Reduce Isolation
Process each untrusted document in its own sealed sub-agent and merge only structured outputs, so an injection in one document cannot steer the processing of others.
Multimodal Guardrails
Input and output guardrails that operate across modalities (vision, audio, file) rather than text only — handling e.g. malicious instructions embedded in image OCR or audio transcription.
Policy-as-Code Gate
Evaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.
Policy-Gated Agent Action (KRITIS)
Each agent action passes through a policy gate (NIS2, EU the agent Act, BSI rules) and is tagged with Run ID + Model Digest + Policy Hash for WORM-audit reconstruction.
Priority Matrix (Conflict Resolution)
Pre-define how the agent must resolve specific classes of goal conflicts via a human-authored lookup table — transforming the agent from a decision-maker (where it fails on competing objectives) into…
Progressive Tool Access
Grant tool permissions on a need-to-use basis, starting minimum and expanding only as the agent proves competency, mirroring how humans earn system access.
Secrets Handling
Ensure the model never receives secrets in plaintext; tools resolve credentials from references at runtime.
Simulate Before Actuate
Before issuing an irreversible action, run a deterministic simulation that computes pre-conditions, invariants, and expected deltas; require a verifier — automated or human — to green-light the simul…
Supervisor-Plus-Gate
Supervisor controller that validates and gates LLM outputs against deterministic checks before they commit to side-effects.
Synchronous Execution-Plan Confirmation
Agent synchronously emits its full execution plan for user confirmation before any side-effect step, and provides asynchronous operation recordings for post-hoc review.
Two Human Touchpoints
Place exactly two human-in-the-loop checkpoints in agentic pipelines: one at content selection and one at final review before publication.
Typed Refusal Codes
Define a single source of truth for machine-readable refusal codes across all guard surfaces, so refusals can be triaged mechanically rather than by string-grepping ad-hoc human-readable messages.
Cryptographic Instruction Authentication
Wrap system/developer instructions in cryptographically signed blocks that user-generated text cannot reproduce; train or scaffold the model to refuse instructions lacking a valid signature.
Quorum on Mutation
Require multiple consecutive ticks (or runs) to agree before a mutation to durable state lands.