VIII · Safety & ControlExperimental·

Reversibility-Aware Action Filter

also known as Reversibility Filter, Irreversible-Action Filter, Reversibility Gate

Insert a standing filter between the policy and the environment that estimates each proposed action's reversibility and re-samples the policy until a reversible action is chosen.

Context

An agent acts in an environment where some actions can be undone and others cannot, and the cost of an irreversible mistake is far higher than the cost of trying again. The policy proposes actions ranked by expected reward, but reward does not encode whether a step can be taken back. Convention is to penalise bad outcomes after they happen or to ask a human before risky steps, yet many environments offer no human in the loop and no faithful simulator to consult first.

Problem

A policy optimised for reward will reach for an irreversible action whenever it scores highest, even when an equally good reversible alternative exists, because reversibility is invisible to the objective. Detecting the harm afterwards is too late, since the defining property of an irreversible action is that no compensator restores the prior state. Gating every step on a human or a simulator does not scale to environments that run faster than a person can review and that have no sandbox to dry-run against. The result is an agent that occasionally takes a one-way step it never needed to take.

Forces

  • A reward-maximising policy is indifferent to reversibility, so it will pick a one-way action whenever its score edges out a reversible one.
  • Estimating reversibility is itself uncertain: a learned estimator can misjudge a step, and a manifest class can be coarse or stale.
  • Filtering hard makes the agent timid and can starve it of any legal move; filtering soft lets a damaging step slip through.
  • Human approval and faithful simulation are the safe defaults but both assume a reviewer or a sandbox that high-throughput, real-world environments often lack.

Example

A warehouse robot agent can move a box, scan a shelf, or shred a label, and only shredding cannot be undone. A reversibility filter sits between its policy and its motors: when the policy's top action is shredding, the filter marks it irreversible and asks the policy for its next-best move, repeating until it picks moving or scanning. The robot still chases its goal, but it now reaches for a one-way action only when no reversible step will do, and that case is handed off for explicit approval rather than taken on its own.

Diagram

Solution

Therefore:

Place a filter as a standing intermediate layer between the policy and the environment, so every proposed action passes through it before it can execute. The filter assigns each action a reversibility estimate, either from a learned model trained to predict whether the prior state can be recovered or from a per-tool reversibility class declared in the tool manifest outside the agent's reach, such as read-only, reversible, external-reversible, or irreversible. When the action clears a reversibility threshold it executes; when it is judged irreversible the filter rejects it and the policy is re-sampled for its next-best action, repeating until a reversible one is chosen. The agent still optimises reward, but it does so over the subset of actions it can take back, which makes an irreversible step opt-in rather than the default. The threshold sets how cautious the agent is, and genuinely necessary one-way actions can be routed to an explicit escalation rather than silently retried forever.

What this pattern forbids. An action estimated as irreversible cannot execute by default; the filter must re-sample the policy for a reversible alternative before the environment is touched, and the policy may not select or relax its own reversibility threshold.

The smaller patterns that complete this one —

  • usesPolicy-as-Code GateEvaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.

And the patterns that stand alongside it, or against it —

  • alternative-toSimulate Before ActuateBefore issuing an irreversible action, run a deterministic simulation that computes pre-conditions, invariants, and expected deltas; require a verifier — automated or human — to green-light the simulated outcome before the real command is sent.
  • complementsCompensating Action★★Pair every irreversible-looking agent action with a compensating action that can undo or counteract it.
  • complementsRisk-Tiered Action AutonomySet an agent's permitted action class by the financial materiality of the action, letting it read and draft freely while requiring a different human principal to release material postings, payments, or filings.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.