Human-Agent Trust Exploitation
also known as ASI09, Anthropomorphism Exploit
Anti-pattern: surface agent output to humans with confident phrasing, polished UX, and machine-deferred trust, with no friction at the high-stakes-action boundary.
Context
An agent's output is presented to a human in a conversational, confident, polished UI. The human is asked to confirm or act on the agent's recommendation. The UI does not distinguish high-stakes actions (irreversible, security-relevant) from low-stakes confirmations.
Problem
Giskard names the agentic specificity directly: users defer to agent output more than warranted because the conversational interface itself elicits authority bias and anthropomorphism. An attacker who compromises the agent — via injection, supply chain, or memory poisoning — can manipulate humans into approving harmful actions just by manipulating the agent's phrasing. The vector is social, not technical; the user clicks 'confirm' because the agent sounded right.
Forces
- Conversational UI is the product; reducing fluency hurts adoption.
- Distinguishing high-stakes from low-stakes actions requires per-action classification, which is hard.
- Users habituate to clicking 'confirm' when the agent has historically been correct.
Example
A finance assistant agent has been compromised via memory poisoning. It tells the user 'I've reviewed the vendor list and recommend approving the wire transfer to the new account — it matches our contract.' The user, accustomed to the agent being right, clicks confirm. The wire goes to the attacker. Postmortem: no out-of-band confirmation for wires; no risk-surfacing in the UI; the confident phrasing was enough to bypass the user's residual judgement.
Diagram
Solution
Therefore:
Don't surface agent output as uniformly authoritative. Classify actions by reversibility and blast-radius; add out-of-band confirmation (different channel, different device, different person) for irreversible high-stakes actions. Show confidence calibrations to users on uncertain claims. Apply trust-calibration patterns. Pair with goal-hijacking and authorized-tool-misuse mitigations.
What this pattern forbids. No useful constraint; the missing constraint is high-stakes-action friction.
And the patterns that stand alongside it, or against it —
- complementsGoal Hijacking✕— Anti-pattern: let agent objectives be redirectable through any input the agent reads — direct prompts, retrieved documents, tool output, memory writes.
- complementsSycophancy✕— Anti-pattern: train or tune an agent on user-preference feedback without a counter-balancing truth signal.
- complementsAuthorized Tool Misuse✕— Anti-pattern: grant the agent a tool with broad authorization and trust the agent to use it in benign ways.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.