Governance & Observability

Attention-Manipulation Explainability

Surface which input tokens caused a given output by perturbing attention across all transformer layers and measuring the resulting change in output probability, producing a per-token relevance map alongside the model's response.

Problem

Asking the model in plain language to explain why it answered the way it did produces fluent, convincing prose that may have nothing to do with the computation that produced the answer. The model can confabulate a reason that sounds reasonable but does not reflect which input tokens actually shifted the output. The team is forced to choose between a polished but unfaithful self-explanation and saying nothing at all, neither of which is acceptable when an auditor wants input-grounded evidence.

Solution

Run a structured perturbation pass over the model's attention: for each input token (or chunk), suppress its attention contribution and measure the change in the output token probabilities. Tokens whose suppression most reduces the output probability are the most relevant. Surface this as a heat-map alongside the answer. Keep the attribution method on the inference side; avoid asking the model to self-explain in prose.

When to use

  • You need a faithful per-token relevance map of which inputs caused a given output.
  • You control inference (open weights or a provider that exposes attention perturbation).
  • Free-text self-explanations are insufficient because the model confabulates its reasons.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related