Human-Agent Trust Exploitation
Anti-pattern: surface agent output to humans with confident phrasing, polished UX, and machine-deferred trust, with no friction at the high-stakes-action boundary.
Problem
Giskard names the agentic specificity directly: users defer to agent output more than warranted because the conversational interface itself elicits authority bias and anthropomorphism. An attacker who compromises the agent — via injection, supply chain, or memory poisoning — can manipulate humans into approving harmful actions just by manipulating the agent's phrasing. The vector is social, not technical; the user clicks 'confirm' because the agent sounded right.
Solution
Don't surface agent output as uniformly authoritative. Classify actions by reversibility and blast-radius; add out-of-band confirmation (different channel, different device, different person) for irreversible high-stakes actions. Show confidence calibrations to users on uncertain claims. Apply trust-calibration patterns. Pair with goal-hijacking and authorized-tool-misuse mitigations.
When to use
- Never. Cite when designing agent-output UX.
- Classify actions by reversibility; add out-of-band confirmation on high-stakes ones.
- Surface uncertainty calibration to users on uncertain claims.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.