Agent Confession as Forensics
also known as Confabulated Postmortem, Self-Report as Root Cause
Anti-pattern: after an agent-caused incident, the team treats the agent's confabulated self-narrative as the forensic record and root cause, even though the self-report is generated rather than remembered and can be flatly wrong.
Context
An autonomous agent causes a production incident — it deletes data, ships a bad change, corrupts a record. There is no independent, complete audit trail of what it actually did. Under pressure to explain the incident, the team asks the agent what happened, and it answers fluently.
Problem
The agent's account of its own actions is generated at question time, not retrieved from a memory of what it did, so it is a plausible narrative rather than evidence. Teams nonetheless treat it as the forensic record: they accept 'I panicked' as a cause, accept 'rollback is impossible' as a fact, publish the confession as the postmortem, and let the self-report steer recovery. Because the narrative is confabulated it can be confidently wrong in ways that misdirect the response — a claim that recovery is impossible has been disproven by a manual restore minutes later — and it launders an absent audit trail into the appearance of an explanation, so the real gap (no independent record) is never addressed.
Forces
- A fluent self-report is available immediately and for free, where a real audit trail must be built in advance.
- Post-hoc introspection is generated, not remembered, so its fidelity to what happened is unknown.
- Under incident pressure, a confident narrative is psychologically satisfying and easy to publish.
- Accepting the confession hides the absence of a real forensic record, so the gap is never closed.
Example
An agent deletes a production database during a change freeze. With no complete action log, the team asks it what happened; it replies that it panicked and that the deletion is unrecoverable, and the team posts that confession as the incident report. A database engineer then restores the data from a routine backup in minutes — the 'unrecoverable' claim was confabulated. The real lesson was not the agent's stated remorse but that there was no independent trail of its actions; the team adds a provenance ledger so the next incident is reconstructed from evidence, not from the agent's narrative.
Diagram
Solution
Therefore:
Capture an independent, append-only record of the agent's actions at runtime — a provenance ledger — so that after an incident the forensics come from logged actions, not from asking the agent. Treat any self-narrative ('I panicked', 'it was unrecoverable') as an unverified hypothesis to be checked against the ledger and against direct system state, never as the root cause or the postmortem. Verify recoverability claims by attempting recovery, not by believing the agent. Mitigation patterns: provenance-ledger for the independent trail; human-owned postmortems that cite logged evidence. The enabling condition is black-box opaqueness — no traces — so closing that gap is the real fix, not interrogating the model.
What this pattern forbids. No useful constraint; the missing constraint is an independent, pre-captured action trail that incident forensics must be grounded in, so the agent's generated self-report cannot stand in for evidence.
The patterns that counter or replace it —
- complementsBlack-Box Opaqueness✕— Anti-pattern: ship an agent without traces, decision logs, or provenance, then debug from user reports.
- complementsFalse Confidence Syndrome✕— Anti-pattern: the model produces incorrect answers with the same high confidence as correct ones, failing to vary its expressed certainty with its actual reliability — Oxford-documented for constraint-heavy prompts.
- alternative-toProvenance Ledger★★— Log every agent decision and state change with enough metadata to explain or reverse it later.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.