Safety & Control

Prompt Injection Defense

Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.

Problem

Large language models cannot reliably distinguish the operator's instructions from instructions embedded in retrieved or user-supplied content, because both arrive as tokens in the same context window. Any document, web page, or tool response that reaches the model is potentially an attacker-authored prompt the model may obey, and the model has no built-in notion of which parts of its context have authority over it. Without a layer that explicitly marks untrusted content and trains the model to treat anything inside those markers as read-only data, the agent will sooner or later follow instructions it should be ignoring.

Solution

Establish an instruction hierarchy: system prompts trusted, user prompts partially trusted, tool/document content untrusted. Wrap untrusted content in markers. Train or prompt the model to refuse instructions inside untrusted markers. Add output guardrails for known exfiltration patterns.

When to use

Untrusted content (user input, retrieved documents, tool output) reaches the model.
A clear instruction hierarchy can be encoded with markers around untrusted content.
Output guardrails can detect known exfiltration patterns.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related