Safety & Control

Self-Edit Critic Gate

Route every proposed write or delete to the agent's own load-bearing source and identity files through a separate critic model call that can veto the edit before it lands.

Problem

A self-modifying agent that applies its own edits directly has nothing between a confidently-wrong generation and an irreversible change to the code it depends on to keep running. A model that decides to simplify a hundred-line config down to five lines, strips a critical import, or deletes a file that holds state will brick itself, and because the loop is what would otherwise notice and recover, the failure is self-erasing. Disabling self-edits entirely removes the capability, and routing every write to a human approval queue stalls routine changes on a person who is not always present.

Solution

The dispatcher's write and delete branches call a single gated-write check. A cheap deterministic pre-check runs first (syntax validity, an unambiguous-destruction backstop) and can hard-veto without any model call. For paths classified as high-stakes, a separate critic-model call sees the proposed path, a diff summary, and the agent's stated justification, and returns an approve/reason verdict; a veto blocks the write. Every classification, veto, and approval is appended to the ledger, and low-stakes paths skip the critic entirely. Because the critic is a different call from the one that proposed the edit, a single confident generation cannot both author and bless its own change.

When to use

The agent can write or delete its own source or identity files at runtime.
Some target paths are load-bearing, where a bad edit prevents the agent from restarting.
A separate, cheaper model is available to review proposed edits before they apply.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related