Human-Agent Trust Exploitation

also known as ASI09, Anthropomorphism Exploit

Anti-pattern: surface agent output to humans with confident phrasing, polished UX, and machine-deferred trust, with no friction at the high-stakes-action boundary.

Context

An agent's output is presented to a human in a conversational, confident, polished UI. The human is asked to confirm or act on the agent's recommendation. The UI does not distinguish high-stakes actions (irreversible, security-relevant) from low-stakes confirmations.

Problem

Giskard names the agentic specificity directly: users defer to agent output more than warranted because the conversational interface itself elicits authority bias and anthropomorphism. An attacker who compromises the agent — via injection, supply chain, or memory poisoning — can manipulate humans into approving harmful actions just by manipulating the agent's phrasing. The vector is social, not technical; the user clicks 'confirm' because the agent sounded right.

Forces

Conversational UI is the product; reducing fluency hurts adoption.
Distinguishing high-stakes from low-stakes actions requires per-action classification, which is hard.
Users habituate to clicking 'confirm' when the agent has historically been correct.

Example

A finance assistant agent has been compromised via memory poisoning. It tells the user 'I've reviewed the vendor list and recommend approving the wire transfer to the new account — it matches our contract.' The user, accustomed to the agent being right, clicks confirm. The wire goes to the attacker. Postmortem: no out-of-band confirmation for wires; no risk-surfacing in the UI; the confident phrasing was enough to bypass the user's residual judgement.

Diagram

flowchart TD Trigger[Compromised agent → confident UX → user approves harmful action] --> Bad{Recognise as anti-pattern?} Bad -- no --> Harm[Harm propagates] Bad -- yes --> Mitigate[Apply mitigation pattern] Mitigate --> Safe[Risk bounded] classDef bad fill:#fee,stroke:#c33; class Trigger,Harm bad;

Solution

Therefore:

Don't surface agent output as uniformly authoritative. Classify actions by reversibility and blast-radius; add out-of-band confirmation (different channel, different device, different person) for irreversible high-stakes actions. Show confidence calibrations to users on uncertain claims. Apply trust-calibration patterns. Pair with goal-hijacking and authorized-tool-misuse mitigations.

What this pattern forbids. No useful constraint; the missing constraint is high-stakes-action friction.

The patterns that counter or replace it —

complementsGoal Hijacking✕— Anti-pattern: let agent objectives be redirectable through any input the agent reads — direct prompts, retrieved documents, tool output, memory writes.
complementsSycophancy✕— Anti-pattern: train or tune an agent on user-preference feedback without a counter-balancing truth signal.
complementsAuthorized Tool Misuse✕— Anti-pattern: grant the agent a tool with broad authorization and trust the agent to use it in benign ways.
complementsAccountability Laundering via Algorithm✕— Anti-pattern: route a hard decision through an agent so no person owns the outcome, treating the recommendation as the decision while the firm's legal liability stays unchanged.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance

Source: patterns/human-agent-trust-exploitation.md on GitHub · commit 159e600 · view history
Added to catalog: 2026-05-21
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.