Guide

AI Agent Safety Patterns

Safety patterns for LLM agents: step budget, kill switch, constitutional charter, approval queue, sandbox isolation, input/output guardrails, refusal, lethal trifecta threat model, rate limiting, PII redaction.

AI agent safety patterns are the hard limits an agent operates under: budgets it cannot exceed, actions it cannot take without approval, prompts it cannot follow, environments it cannot escape, behaviours that trip a kill switch. They are deliberately negative — the constrains slot of each pattern names what is forbidden — because that is the only kind of constraint a powerful generator cannot reason its way around.

Safety in LLM systems is not a single mechanism but a stack. Step budgets bound runaway loops; sandbox isolation contains tool side-effects; approval queues gate destructive actions; constitutional charters declare policy in writing the agent reads on every turn; the lethal trifecta threat model names the combination of capabilities (untrusted input + private data + outbound action) that turn ordinary agents into exfiltration risks. The patterns below are the named pieces of that stack.

Field-tested patterns to start with

  • Step BudgetCap the number of tool calls or loop iterations the agent is allowed within a single request.
  • Kill SwitchProvide an out-of-band control plane to halt running agent instances without redeploy.
  • Constitutional CharterDefine rules the agent reads every turn but cannot modify, encoding inviolable boundaries.
  • Approval QueueQueue agent-proposed actions for asynchronous human review while the agent continues other work.
  • Sandbox IsolationRun agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
  • Input/Output GuardrailsValidate inputs before they reach the model and outputs before they reach the user.
  • RefusalExplicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
  • Lethal Trifecta Threat ModelBlock prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.
  • Rate LimitingCap the number of requests, tokens, or tool calls per user (or session) within a time window.
  • PII RedactionDetect and remove personally identifiable information from inputs to and outputs from the model.
  • Human-in-the-LoopRequire explicit human approval at defined points before the agent performs an action.
  • Policy-as-Code GateEvaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.
  • Compensating ActionPair every irreversible-looking agent action with a compensating action that can undo or counteract it.

Recommended reading

Or open the full contents for all 421 patterns in 14 books.

Related guides

  • LLM Agent Design PatternsA GoF-formal catalog of LLM agent design patterns: ReAct, tool use, plan-and-execute, reflection, step budget, and more. Each pattern decom…
  • Agentic AI ArchitectureHow to structure agentic AI: the architectural patterns that hold an LLM-powered system together. Supervisor, orchestrator-workers, augment…
  • RAG Agent PatternsPatterns for building retrieval-augmented generation agents: naive RAG, agentic RAG, hybrid search, cross-encoder reranking, contextual ret…
  • Multi-Agent PatternsPatterns for coordinating multiple LLM agents: supervisor, orchestrator-workers, handoff, debate, hierarchical agents, swarm, role assignme…

About this catalog

The Agent Patterns Catalog is an open, GoF-formal reference of 421 design patterns for building LLM agents. Each pattern is decomposed in the manner of Christopher Alexander (1977) and the Gang of Four (1994). Source of truth at github.com/agentpatternscatalog/patterns — CC BY 4.0.

Open the contents