Safety & Control

Reversibility-Aware Action Filter

Insert a standing filter between the policy and the environment that estimates each proposed action's reversibility and re-samples the policy until a reversible action is chosen.

Problem

A policy optimised for reward will reach for an irreversible action whenever it scores highest, even when an equally good reversible alternative exists, because reversibility is invisible to the objective. Detecting the harm afterwards is too late, since the defining property of an irreversible action is that no compensator restores the prior state. Gating every step on a human or a simulator does not scale to environments that run faster than a person can review and that have no sandbox to dry-run against. The result is an agent that occasionally takes a one-way step it never needed to take.

Solution

Place a filter as a standing intermediate layer between the policy and the environment, so every proposed action passes through it before it can execute. The filter assigns each action a reversibility estimate, either from a learned model trained to predict whether the prior state can be recovered or from a per-tool reversibility class declared in the tool manifest outside the agent's reach, such as read-only, reversible, external-reversible, or irreversible. When the action clears a reversibility threshold it executes; when it is judged irreversible the filter rejects it and the policy is re-sampled for its next-best action, repeating until a reversible one is chosen. The agent still optimises reward, but it does so over the subset of actions it can take back, which makes an irreversible step opt-in rather than the default. The threshold sets how cautious the agent is, and genuinely necessary one-way actions can be routed to an explicit escalation rather than silently retried forever.

When to use

The environment mixes reversible and irreversible actions and the cost of an unnecessary one-way step is high.
There is no human reviewer in the loop and no faithful simulator to dry-run actions against before they execute.
Reversibility can be estimated, either by a learned model or by a per-tool reversibility class declared in the manifest.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related