AI Agent Safety Patterns
Safety patterns for LLM agents: step budget, kill switch, constitutional charter, approval queue, sandbox isolation, input/output guardrails, refusal, lethal trifecta threat model, rate limiting, PII redaction.
AI agent safety patterns are the hard limits an agent operates under: budgets it cannot exceed, actions it cannot take without approval, prompts it cannot follow, environments it cannot escape, behaviours that trip a kill switch. They are deliberately negative — the constrains slot of each pattern names what is forbidden — because that is the only kind of constraint a powerful generator cannot reason its way around.
Safety in LLM systems is not a single mechanism but a stack. Step budgets bound runaway loops; sandbox isolation contains tool side-effects; approval queues gate destructive actions; constitutional charters declare policy in writing the agent reads on every turn; the lethal trifecta threat model names the combination of capabilities (untrusted input + private data + outbound action) that turn ordinary agents into exfiltration risks. The patterns below are the named pieces of that stack.
Field-tested patterns to start with
- Step Budget — Cap the number of tool calls or loop iterations the agent is allowed within a single request.
- Kill Switch — Provide an out-of-band control plane to halt running agent instances without redeploy.
- Constitutional Charter — Define rules the agent reads every turn but cannot modify, encoding inviolable boundaries.
- Approval Queue — Queue agent-proposed actions for asynchronous human review while the agent continues other work.
- Sandbox Isolation — Run agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
- Input/Output Guardrails — Validate inputs before they reach the model and outputs before they reach the user.
- Refusal — Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
- Lethal Trifecta Threat Model — Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.
- Rate Limiting — Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
- PII Redaction — Detect and remove personally identifiable information from inputs to and outputs from the model.
- Human-in-the-Loop — Require explicit human approval at defined points before the agent performs an action.
- Policy-as-Code Gate — Evaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.
- Compensating Action — Pair every irreversible-looking agent action with a compensating action that can undo or counteract it.
Recommended reading
- Safety & Control — 48 patterns
- Governance & Observability — 27 patterns
Or open the full contents for all 421 patterns in 14 books.
Related guides
- LLM Agent Design Patterns — A GoF-formal catalog of LLM agent design patterns: ReAct, tool use, plan-and-execute, reflection, step budget, and more. Each pattern decom…
- Agentic AI Architecture — How to structure agentic AI: the architectural patterns that hold an LLM-powered system together. Supervisor, orchestrator-workers, augment…
- RAG Agent Patterns — Patterns for building retrieval-augmented generation agents: naive RAG, agentic RAG, hybrid search, cross-encoder reranking, contextual ret…
- Multi-Agent Patterns — Patterns for coordinating multiple LLM agents: supervisor, orchestrator-workers, handoff, debate, hierarchical agents, swarm, role assignme…
About this catalog
The Agent Patterns Catalog is an open, GoF-formal reference of 421 design patterns for building LLM agents. Each pattern is decomposed in the manner of Christopher Alexander (1977) and the Gang of Four (1994). Source of truth at github.com/agentpatternscatalog/patterns — CC BY 4.0.