Prompt Injection Defense
Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.
Problem
Large language models cannot reliably distinguish the operator's instructions from instructions embedded in retrieved or user-supplied content, because both arrive as tokens in the same context window. Any document, web page, or tool response that reaches the model is potentially an attacker-authored prompt the model may obey, and the model has no built-in notion of which parts of its context have authority over it. Without a layer that explicitly marks untrusted content and trains the model to treat anything inside those markers as read-only data, the agent will sooner or later follow instructions it should be ignoring.
Solution
Establish an instruction hierarchy: system prompts trusted, user prompts partially trusted, tool/document content untrusted. Wrap untrusted content in markers. Train or prompt the model to refuse instructions inside untrusted markers. Add output guardrails for known exfiltration patterns.
When to use
- Untrusted content (user input, retrieved documents, tool output) reaches the model.
- A clear instruction hierarchy can be encoded with markers around untrusted content.
- Output guardrails can detect known exfiltration patterns.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.
Related
- Dual LLM Pattern
- Input/Output Guardrails
- Lethal Trifecta Threat Model
- Session Isolation
- Tool Output Poisoning Defense
- Memory Poisoning
- Agent-Generated Code RCE
- Goal Hijacking
- Memory Extraction Attack
- Control-Flow Integrity
- Multimodal Guardrails
- AI-Targeted Comment Injection
- Action Selector Pattern
- Cryptographic Instruction Authentication