Safety & Control

Dual LLM Pattern

Split agent work between a privileged model that holds tool access and a quarantined model that reads untrusted content, exchanging only opaque references between them.

Problem

When one model both reads the untrusted text and decides which tools to call, a single successful prompt injection buried in an inbound email or a fetched web page can hijack the action loop and drive the tools the operator gave the agent. The model has no reliable way to tell instructions in the system prompt apart from instructions smuggled in as data, because both arrive as tokens in the same context window. Filtering or labelling untrusted text before it reaches the model is unreliable — every filter has bypasses — and prompting the model to ignore embedded instructions does not survive a clever payload.

Solution

Run two models with disjoint privileges. A Privileged LLM plans, holds tool access, and never sees raw untrusted content. A Quarantined LLM ingests the untrusted content but has no tools and cannot emit free-form actions. The two communicate through symbolic references: the Quarantined LLM extracts typed values (an email address, a date, a summary) and returns them as opaque handles; the Privileged LLM composes tool calls using those handles, with the host substituting the underlying values only at execution time.

When to use

  • Agent processes content from sources the operator does not control (email, web, third-party APIs).
  • Tool calls in the agent can take consequential actions (send, write, pay, publish).
  • Information from untrusted content can be reduced to typed values (addresses, dates, IDs, short strings) rather than free-form text the privileged model must reason over verbatim.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related