X · Governance & ObservabilityExperimental·

Attention-Manipulation Explainability

also known as AtMan, Attention Perturbation Attribution, Token-Influence Map

Surface which input tokens caused a given output by perturbing attention across all transformer layers and measuring the resulting change in output probability, producing a per-token relevance map alongside the model's response.

Context

A team operates a transformer-based language model in a setting where someone — an auditor, a regulator, a clinician, a loan applicant — can demand a real explanation for any given output. The team controls inference enough to inspect the model's internal attention weights, either because the weights are open or because the provider exposes a way to perturb attention. A generated paragraph of self-justification will not satisfy the people asking, because what they want is evidence about which parts of the input actually drove the answer.

Problem

Asking the model in plain language to explain why it answered the way it did produces fluent, convincing prose that may have nothing to do with the computation that produced the answer. The model can confabulate a reason that sounds reasonable but does not reflect which input tokens actually shifted the output. The team is forced to choose between a polished but unfaithful self-explanation and saying nothing at all, neither of which is acceptable when an auditor wants input-grounded evidence.

Forces

  • Auditors want input-grounded explanations, not generated rationales.
  • Per-token attribution must be cheap enough to run in production, not only offline.
  • Faithfulness of the explanation matters more than its readability.
  • Vendor-side method may be incompatible with hosted black-box APIs.

Example

A medical-summarisation agent recommends a contraindicated drug and the clinician asks why. Asking the model to justify itself produces a polished but invented rationale that doesn't actually match the input that swayed it. The team layers Attention-Manipulation Explainability: they perturb attention to each input token across all transformer layers and measure how the output probability shifts, producing a per-token relevance map served alongside the response. Now the clinician can see that the recommendation hinged on a single ambiguous lab value, not on the patient history the prose claimed.

Diagram

Solution

Therefore:

Run a structured perturbation pass over the model's attention: for each input token (or chunk), suppress its attention contribution and measure the change in the output token probabilities. Tokens whose suppression most reduces the output probability are the most relevant. Surface this as a heat-map alongside the answer. Keep the attribution method on the inference side; avoid asking the model to self-explain in prose.

What this pattern forbids. The agent may not present generated text as the explanation of its own output when an attribution-based explanation is feasible; self-explanations have to be marked as such.

And the patterns that stand alongside it, or against it —

  • complementsDecision Log★★Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.
  • complementsConfidence ReportingSurface the agent's uncertainty about its answer alongside the answer itself.
  • complementsLineage Tracking★★Track which prompt version, model version, and data sources produced each agent output.
  • alternative-toCitation Streaming★★Stream citations alongside generated text so the UI can render source links in place as content appears.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.