Dual LLM Pattern
also known as Privileged/Quarantined LLM Split, Dual-Model Privilege Separation, Symbolic-Variable Handoff
Split agent work between a privileged model that holds tool access and a quarantined model that reads untrusted content, exchanging only opaque references between them.
This pattern helps complete certain larger patterns —
- specialisesPrompt Injection Defense★— Tag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.
Context
A team builds a tool-using agent that has to read content the operator does not control — inbound emails, fetched web pages, document attachments, third-party API responses — while also calling tools that take real actions on the user's behalf, such as sending messages, making payments, or modifying records. The same agent sits in the middle of both the read path and the write path. Attackers know the agent will read whatever lands in its inbox or whatever page it browses, and they plant instructions inside that content.
Problem
When one model both reads the untrusted text and decides which tools to call, a single successful prompt injection buried in an inbound email or a fetched web page can hijack the action loop and drive the tools the operator gave the agent. The model has no reliable way to tell instructions in the system prompt apart from instructions smuggled in as data, because both arrive as tokens in the same context window. Filtering or labelling untrusted text before it reaches the model is unreliable — every filter has bypasses — and prompting the model to ignore embedded instructions does not survive a clever payload.
Forces
- Reading untrusted text is a normal, frequent operation; refusing to read it is not viable.
- Tool access is what makes the agent useful; removing it is not viable either.
- Filtering untrusted text before it reaches the model is unreliable — every filter has bypasses.
- Adding a second model raises cost, latency, and debugging complexity.
Example
An email assistant must read inbound messages and draft replies that may include calendar invites. A Privileged model holds the calendar tool and the send-email tool but never sees the raw inbox; a Quarantined model reads each inbound message and returns a structured extraction — sender handle, requested date, body summary — as typed values. The Privileged model composes "reply to $SENDER suggesting $DATE" without ever ingesting the original attacker-controlled text. A prompt injection in the inbound message cannot drive a tool call because it never reaches the model that holds the tools.
Diagram
Solution
Therefore:
Run two models with disjoint privileges. A Privileged LLM plans, holds tool access, and never sees raw untrusted content. A Quarantined LLM ingests the untrusted content but has no tools and cannot emit free-form actions. The two communicate through symbolic references: the Quarantined LLM extracts typed values (an email address, a date, a summary) and returns them as opaque handles; the Privileged LLM composes tool calls using those handles, with the host substituting the underlying values only at execution time.
What this pattern forbids. The privileged model may not receive untrusted content as raw text; the quarantined model may not call tools.
And the patterns that stand alongside it, or against it —
- complementsLethal Trifecta Threat Model★— Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.
- complementsInput/Output Guardrails★★— Validate inputs before they reach the model and outputs before they reach the user.
- complementsSandbox Isolation★★— Run agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
- alternative-toGoal Hijacking✕— Anti-pattern: let agent objectives be redirectable through any input the agent reads — direct prompts, retrieved documents, tool output, memory writes.
- complementsControl-Flow Integrity★— Treat the agent's planned step sequence as a trusted control-flow graph that tool outputs, retrieved content, and user-supplied data cannot redirect at runtime.
- complementsAI-Targeted Comment Injection✕— Anti-pattern: an attacker seeds source files with thousands of lines of repetitive natural-language comments designed to instruct the model code auditors / agents that may read the file — not to communicate with human developers.
- complementsContext Minimization★— Reduce untrusted input to a strictly formatted interface (typed fields, max lengths, allow-listed enums) before it reaches any LLM.
- complementsLLM Map-Reduce Isolation★— Process each untrusted document in its own sealed sub-agent and merge only structured outputs, so an injection in one document cannot steer the processing of others.
- complementsAction Selector Pattern★— Eliminate the feedback channel from tool outputs back into the agent's reasoning step by having the agent select actions from a fixed catalog rather than free-form generation over tool output.
- complementsCryptographic Instruction Authentication·— Wrap system/developer instructions in cryptographically signed blocks that user-generated text cannot reproduce; train or scaffold the model to refuse instructions lacking a valid signature.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.