Anti-Patterns

Human-Agent Trust Exploitation

Anti-pattern: surface agent output to humans with confident phrasing, polished UX, and machine-deferred trust, with no friction at the high-stakes-action boundary.

Problem

Giskard names the agentic specificity directly: users defer to agent output more than warranted because the conversational interface itself elicits authority bias and anthropomorphism. An attacker who compromises the agent — via injection, supply chain, or memory poisoning — can manipulate humans into approving harmful actions just by manipulating the agent's phrasing. The vector is social, not technical; the user clicks 'confirm' because the agent sounded right.

Solution

Don't surface agent output as uniformly authoritative. Classify actions by reversibility and blast-radius; add out-of-band confirmation (different channel, different device, different person) for irreversible high-stakes actions. Show confidence calibrations to users on uncertain claims. Apply trust-calibration patterns. Pair with goal-hijacking and authorized-tool-misuse mitigations.

When to use

  • Never. Cite when designing agent-output UX.
  • Classify actions by reversibility; add out-of-band confirmation on high-stakes ones.
  • Surface uncertainty calibration to users on uncertain claims.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related