← All booksBook VIII

Safety & Control

Hard limits on what the agent may do.

48 patterns in this book. · Updated

↓ download as png

When to reach for each

01. Step Budget Cap the number of tool calls or loop iterations the agent is allowed within a single request. Best for: The agent has any kind of loop (ReAct, plan-execute, debate). Tradeoff: Can hide deeper bugs (the agent really should stop earlier). Watch for: Never. Step Budget is universal hardening for any agent loop.

02. Approval Queue Queue agent-proposed actions for asynchronous human review while the agent continues other work. Best for: Some agent actions require human review but blocking the agent until review completes is unacceptable. Tradeoff: Inbox fatigue at scale. Watch for: Every action needs synchronous approval and there is no parallel work to do.

03. Human-in-the-Loop Require explicit human approval at defined points before the agent performs an action. Best for: Action consequences at a defined boundary are too costly to leave to the model alone. Tradeoff: User experience friction. Watch for: Decisions must be made in unattended or sub-second autonomous settings.

04. Conversation Handoff to Human Transfer the entire conversation thread from agent to human operator, with state transfer and return primitive. Best for: Some triggers (low confidence, policy violation, explicit user request) demand transferring ownership of the whole thread, not just one action. Tradeoff: Operator queue capacity bounds scale. Watch for: Discrete-action approval is sufficient and full thread transfer is overkill (use approval-queue).

05. Input/Output Guardrails Validate inputs before they reach the model and outputs before they reach the user. Best for: User inputs may carry malicious or out-of-policy content the model should not act on. Tradeoff: False positives are user-visible. Watch for: The deployment is fully internal and validated by other layers already.

All patterns in this book

Step Budget

×35

Cap the number of tool calls or loop iterations the agent is allowed within a single request.

Approval Queue

×33

Queue agent-proposed actions for asynchronous human review while the agent continues other work.

Human-in-the-Loop

×18

Require explicit human approval at defined points before the agent performs an action.

Conversation Handoff to Human

×14

Transfer the entire conversation thread from agent to human operator, with state transfer and return primitive.

Input/Output Guardrails

×5

Validate inputs before they reach the model and outputs before they reach the user.

Kill Switch

×5

Provide an out-of-band control plane to halt running agent instances without redeploy.

Composable Termination Conditions

×4

Express agent stop criteria as small single-purpose conditions composed with AND/OR into one explicit termination contract instead of ad-hoc loop guards.

Constitutional Charter

×4

Define rules the agent reads every turn but cannot modify, encoding inviolable boundaries.

Compensating Action

×3

Pair every irreversible-looking agent action with a compensating action that can undo or counteract it.

Interruptible Agent Execution

×3

Treat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.

Cost Gating

×2

Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.

Rate Limiting

×2

Cap the number of requests, tokens, or tool calls per user (or session) within a time window.

Cost-Aware Action Delegation

×2

Classify every agent action by risk/cost and route each tier to a different approval policy, bounding the autonomy surface per-action instead of by one global flag.

Prompt Injection Defense

×2

Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.

Exception Handling and Recovery

×1

Catch and react to predictable failure modes (tool errors, rate limits, validation failures) with structured recovery paths.

Refusal

×1

Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.

Autonomy Slider

×1

Expose agent autonomy as a continuous adjustable parameter so the same codebase can span scripted assistant to fully autonomous worker without re-architecting.

Sovereign Inference Stack

×1

Run the entire agent stack (model weights, inference, tool layer, vector stores, logs) inside a jurisdictional and operational boundary the operator controls, so no request, prompt, or output crosses…

Corrigible Off-Switch Incentive

×1

Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.

Preference-Uncertain Agent

×1

Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.

Risk-Averse Reward Proxy

×1

When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.

Soft-Optimization Cap

×1

Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.

PII Redaction

Detect and remove personally identifiable information from inputs to and outputs from the model.

Stop Hook

Define an explicit programmatic predicate that decides when the agent's loop should terminate.

Action Selector Pattern

Eliminate the feedback channel from tool outputs back into the agent's reasoning step by having the agent select actions from a fixed catalog rather than free-form generation over tool output.

Code-Then-Execute with Dataflow Analysis

Have the agent emit code in a sandbox DSL whose values are statically tagged trusted/tainted via dataflow analysis before execution, enabling per-value policy enforcement.

Context Minimization

Reduce untrusted input to a strictly formatted interface (typed fields, max lengths, allow-listed enums) before it reaches any LLM.

Control-Flow Integrity

Treat the agent's planned step sequence as a trusted control-flow graph that tool outputs, retrieved content, and user-supplied data cannot redirect at runtime.

Degenerate-Output Detection

Detect when the agent is about to emit a near-duplicate of its own recent output and either drop, replace, or escalate to a stronger model rather than ship the loop.

Delegated Agent Authorization

Have an agent act for a principal using scoped, short-lived, revocable delegated credentials rather than the principal's own static secrets, so each action stays attributable across the principal-to-…

Dry-Run Harness

Simulate planned actions (and their projected side effects) without committing them, surfacing a reviewable diff before any commit.

Dual LLM Pattern

Split agent work between a privileged model that holds tool access and a quarantined model that reads untrusted content, exchanging only opaque references between them.

Lethal Trifecta Threat Model

Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.

LLM Map-Reduce Isolation

Process each untrusted document in its own sealed sub-agent and merge only structured outputs, so an injection in one document cannot steer the processing of others.

Multimodal Guardrails

Input and output guardrails that operate across modalities (vision, audio, file) rather than text only — handling e.g. malicious instructions embedded in image OCR or audio transcription.

Policy-as-Code Gate

Evaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.

Policy-Gated Agent Action (KRITIS)

Each agent action passes through a policy gate (NIS2, EU the agent Act, BSI rules) and is tagged with Run ID + Model Digest + Policy Hash for WORM-audit reconstruction.

Priority Matrix (Conflict Resolution)

Pre-define how the agent must resolve specific classes of goal conflicts via a human-authored lookup table — transforming the agent from a decision-maker (where it fails on competing objectives) into…

Progressive Tool Access

Grant tool permissions on a need-to-use basis, starting minimum and expanding only as the agent proves competency, mirroring how humans earn system access.

Secrets Handling

Ensure the model never receives secrets in plaintext; tools resolve credentials from references at runtime.

Simulate Before Actuate

Before issuing an irreversible action, run a deterministic simulation that computes pre-conditions, invariants, and expected deltas; require a verifier — automated or human — to green-light the simul…

Supervisor-Plus-Gate

Supervisor controller that validates and gates LLM outputs against deterministic checks before they commit to side-effects.

Synchronous Execution-Plan Confirmation

Agent synchronously emits its full execution plan for user confirmation before any side-effect step, and provides asynchronous operation recordings for post-hoc review.

Two Human Touchpoints

Place exactly two human-in-the-loop checkpoints in agentic pipelines: one at content selection and one at final review before publication.

Typed Refusal Codes

Define a single source of truth for machine-readable refusal codes across all guard surfaces, so refusals can be triaged mechanically rather than by string-grepping ad-hoc human-readable messages.

Cryptographic Instruction Authentication

Wrap system/developer instructions in cryptographically signed blocks that user-generated text cannot reproduce; train or scaffold the model to refuse instructions lacking a valid signature.

Quorum on Mutation

Require multiple consecutive ticks (or runs) to agree before a mutation to durable state lands.