Guide

AI Agent Safety Patterns

Safety patterns for LLM agents: step budget, kill switch, constitutional charter, approval queue, sandbox isolation, input/output guardrails, refusal, lethal trifecta threat model, rate limiting, PII redaction.

AI agent safety patterns are the hard limits an agent operates under: budgets it cannot exceed, actions it cannot take without approval, prompts it cannot follow, environments it cannot escape, behaviours that trip a kill switch. They are deliberately negative — the constrains slot of each pattern names what is forbidden — because that is the only kind of constraint a powerful generator cannot reason its way around.

Safety in LLM systems is not a single mechanism but a stack. Step budgets bound runaway loops; sandbox isolation contains tool side-effects; approval queues gate destructive actions; constitutional charters declare policy in writing the agent reads on every turn; the lethal trifecta threat model names the combination of capabilities (untrusted input + private data + outbound action) that turn ordinary agents into exfiltration risks. The patterns below are the named pieces of that stack.

Field-tested patterns to start with

Step Budget — Cap the number of tool calls or loop iterations the agent is allowed within a single request.
Kill Switch — Provide an out-of-band control plane to halt running agent instances without redeploy.
Constitutional Charter — Define rules the agent reads every turn but cannot modify, encoding inviolable boundaries.
Approval Queue — Queue agent-proposed actions for asynchronous human review while the agent continues other work.
Sandbox Isolation — Run agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
Input/Output Guardrails — Validate inputs before they reach the model and outputs before they reach the user.
Refusal — Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
Lethal Trifecta Threat Model — Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.
Rate Limiting — Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
PII Redaction — Detect and remove personally identifiable information from inputs to and outputs from the model.
Human-in-the-Loop — Require explicit human approval at defined points before the agent performs an action.
Policy-as-Code Gate — Evaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.
Compensating Action — Pair every irreversible-looking agent action with a compensating action that can undo or counteract it.

Related guides

AI Agents Patterns — AI agents patterns: named, reusable shapes for building AI agents that reason, use tools, coordinate, and stay safe — single-agent loops an…
AI Agents Patterns Catalog — The AI agents patterns catalog: a complete, GoF-formal pattern language for AI agents across reasoning, planning, tool use, retrieval, memo…
LLM Agent Design Patterns — A GoF-formal catalog of LLM agent design patterns: ReAct, tool use, plan-and-execute, reflection, step budget, and more. Each pattern decom…
Agentic Design Patterns — A GoF-formal catalog of agentic design patterns — named, reusable shapes for building autonomous AI agents: agent loops, tool use, planning…
Agentic AI Design Patterns — Agentic AI design patterns for systems already in production — what to ship, what to observe, what to budget, what to gate. Augmented LLM,…
AI Agent Design Patterns — How to build an AI agent: the named shapes you reach for during design and implementation — reasoning (ReAct, plan-and-execute, reflection)…
Agent Design Patterns — Agent design patterns treat the agent loop as a software-engineering primitive: an observe→reason→act cycle wrapped in tools, memory, super…
Agentic Patterns — A complete pattern language for agentic systems, organised in Alexander-style books across reasoning, planning, tool use, retrieval, verifi…
Agentic AI Architecture — How to structure agentic AI: the architectural patterns that hold an LLM-powered system together. Supervisor, orchestrator-workers, augment…
RAG Agent Patterns — Patterns for building retrieval-augmented generation agents: naive RAG, agentic RAG, hybrid search, cross-encoder reranking, contextual ret…
Multi-Agent Patterns — Patterns for coordinating multiple LLM agents: supervisor, orchestrator-workers, handoff, debate, hierarchical agents, swarm, role assignme…

About this catalog

The Agent Patterns Catalog is an open, GoF-formal reference of 527 design patterns for building LLM agents. Each pattern is decomposed in the manner of Christopher Alexander (1977) and the Gang of Four (1994). Source of truth at github.com/agentpatternscatalog/patterns — CC BY 4.0.

Open the contents

Field-tested patterns to start with

Recommended reading

Related guides

About this catalog