Input/Output Guardrails
also known as Guards, Validators, Content Filters
Validate inputs before they reach the model and outputs before they reach the user.
This pattern helps complete certain larger patterns —
- used-byAgent Middleware Chain★— Wrap every model call, tool call, and memory access in a composable pre/execute/post interceptor pipeline so cross-cutting concerns attach without touching agent or orchestrator code.
Context
A team runs a production agent exposed to real users on the input side and to real downstream consumers on the output side. The input side receives adversarial content — prompt-injection payloads, attempts to coax the model into leaking secrets or personally identifying information, requests to violate policy. The output side risks shipping payloads that fail schema, contain toxic content, echo a credit card number, or otherwise breach what the operator promised customers and regulators.
Problem
Asking the model itself to police what flows in and out fails by construction: the model is the very surface being defended, and the same generation that might leak a secret is also the one being asked to refuse to leak it. A clever attacker only needs to find one phrasing that flips the model's behaviour. Without a layer outside the model that runs deterministic checks on both the input and the output path, the team is left trusting the model to be its own gatekeeper, which it provably cannot do under adversarial pressure.
Forces
- Guards add latency and cost.
- Over-strict guards block legitimate traffic.
- Adversarial inputs evolve; guards must too.
Example
A consumer-facing chatbot built on a frontier model gets jailbroken on launch day with a classic 'ignore previous instructions' payload pasted into the user message, and a separate user discovers it will happily echo a stored credit-card number on request. The team adds input-output-guardrails: an input pipeline runs regex plus a small classifier and rejects known injection shapes; the output pipeline runs schema validation, a toxicity classifier, and a card/SSN redactor. Both classes of incident drop to near-zero within a week.
Diagram
Solution
Therefore:
Place validators on input (regex, classifier, allowlist) and output (schema, toxicity classifier, secret-redaction) paths. Compose validators per use case. On failure, exception or fallback response. Hub of pre-built validators is reusable across products.
What this pattern forbids. Inputs not passing input guards never reach the model; outputs not passing output guards never reach the user.
The smaller patterns that complete this one —
- generalisesPII Redaction★★— Detect and remove personally identifiable information from inputs to and outputs from the model.
- usesStructured Output★★— Constrain the model's output to conform to a JSON Schema (or similar typed shape).
- generalisesMultimodal Guardrails★— Input and output guardrails that operate across modalities (vision, audio, file) rather than text only — handling e.g. malicious instructions embedded in image OCR or audio transcription.
And the patterns that stand alongside it, or against it —
- complementsCode-Switching-Aware Agent★— Treat mixed-language input (e.g. Hinglish in Roman script) as the expected shape, and design tokenisation, language tagging, and tool routing to handle it natively without forcing the user to commit to one language.
- complementsComputer Use★— Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
- complementsDual LLM Pattern★— Split agent work between a privileged model that holds tool access and a quarantined model that reads untrusted content, exchanging only opaque references between them.
- complementsLethal Trifecta Threat Model★— Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.
- composes-withPrompt Injection Defense★— Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.
- complementsRefusal★★— Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
- composes-withSandbox Isolation★★— Run agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
- composes-withSecrets Handling★— Ensure the model never receives secrets in plaintext; tools resolve credentials from references at runtime.
- complementsSession Isolation★★— Keep one user's session state and memory unreachable from another user's agent.
- composes-withTool Output Poisoning Defense★— Treat tool output as untrusted content and apply instruction-stripping plus per-tool trust labels.
- alternative-toTool Output Trusted Verbatim✕— Anti-pattern: trust whatever tools return without validation, schema enforcement, or trust labels.
- complementsProactive Goal Creator★— Anticipate the user's goal by capturing surrounding multimodal context (gestures, screen state, environment) in addition to what the user types or says.
- complementsPolicy-as-Code Gate★— Evaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.
- complementsTyped Refusal Codes★— Define a single source of truth for machine-readable refusal codes across all guard surfaces, so refusals can be triaged mechanically rather than by string-grepping ad-hoc human-readable messages.
- complementsAuthorized Tool Misuse✕— Anti-pattern: grant the agent a tool with broad authorization and trust the agent to use it in benign ways.
- complementsContext Minimization★— Reduce untrusted input to a strictly formatted interface (typed fields, max lengths, allow-listed enums) before it reaches any LLM.
- complementsSupervisor-Plus-Gate★— Supervisor controller that validates and gates LLM outputs against deterministic checks before they commit to side-effects.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.