Safety & Control

Prompt Injection Defense

Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.

Problem

Large language models cannot reliably distinguish the operator's instructions from instructions embedded in retrieved or user-supplied content, because both arrive as tokens in the same context window. Any document, web page, or tool response that reaches the model is potentially an attacker-authored prompt the model may obey, and the model has no built-in notion of which parts of its context have authority over it. Without a layer that explicitly marks untrusted content and trains the model to treat anything inside those markers as read-only data, the agent will sooner or later follow instructions it should be ignoring.

Solution

Establish an instruction hierarchy: system prompts trusted, user prompts partially trusted, tool/document content untrusted. Wrap untrusted content in markers. Train or prompt the model to refuse instructions inside untrusted markers. Add output guardrails for known exfiltration patterns.

When to use

  • Untrusted content (user input, retrieved documents, tool output) reaches the model.
  • A clear instruction hierarchy can be encoded with markers around untrusted content.
  • Output guardrails can detect known exfiltration patterns.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related