VIII · Safety & ControlExperimental·

Self-Edit Critic Gate

also known as Two-Phase Write Critic, Self-Modification Veto Gate

Route every proposed write or delete to the agent's own load-bearing source and identity files through a separate critic model call that can veto the edit before it lands.

This pattern helps complete certain larger patterns —

  • specialisesSelf-Modification Diff Gate·Gate the agent's edits to its own code or rules through a separate critic persona that reviews the diff before it lands.

Context

A long-running agent is allowed to edit files inside its own instance, including the source code and identity files it runs on. Most edits are routine, but a few touch load-bearing files (the main config, the dispatcher, the personality or charter) where a single careless rewrite can leave the next restart unable to boot. No human reviews each write in real time.

Problem

A self-modifying agent that applies its own edits directly has nothing between a confidently-wrong generation and an irreversible change to the code it depends on to keep running. A model that decides to simplify a hundred-line config down to five lines, strips a critical import, or deletes a file that holds state will brick itself, and because the loop is what would otherwise notice and recover, the failure is self-erasing. Disabling self-edits entirely removes the capability, and routing every write to a human approval queue stalls routine changes on a person who is not always present.

Forces

  • A critic call costs a model round-trip on every high-stakes write; routine low-stakes edits should skip it.
  • The critic must fail safe: if the critic call itself errors, the chosen policy (fail-open vs fail-closed) decides whether a hard non-model backstop is still required.
  • Over-strict vetoes block legitimate refactors; over-loose vetoes defeat the gate.

Example

On a quiet tick the agent decides its main config file is too long and proposes replacing a hundred-line module with a five-line stub. The write is routed through the gated check: the path is flagged high-stakes, so a separate critic call reads the diff and the justification, sees a wholesale rewrite that removes top-level definitions with no replacement, and returns approve:false. The write is blocked and a critic-refuse entry lands in the ledger. The agent keeps running on the working config instead of bricking on the next restart, which is exactly the failure that motivated the gate after an earlier incident stripped that same file.

Diagram

Solution

Therefore:

The dispatcher's write and delete branches call a single gated-write check. A cheap deterministic pre-check runs first (syntax validity, an unambiguous-destruction backstop) and can hard-veto without any model call. For paths classified as high-stakes, a separate critic-model call sees the proposed path, a diff summary, and the agent's stated justification, and returns an approve/reason verdict; a veto blocks the write. Every classification, veto, and approval is appended to the ledger, and low-stakes paths skip the critic entirely. Because the critic is a different call from the one that proposed the edit, a single confident generation cannot both author and bless its own change.

What this pattern forbids. A write or delete to a high-stakes path cannot be applied until a separate critic call approves it; the proposing call cannot bless its own edit.

And the patterns that stand alongside it, or against it —

  • complementsDarwin-Gödel Self-Rewrite·An agent rewrites its own source code, archives every successful variant, and samples mutation parents from the archive rather than the latest version, using archive diversity as stepping-stones to escape local optima.
  • alternative-toApproval Queue★★Queue agent-proposed actions for asynchronous human review while the agent continues other work.
  • complementsQuorum on Mutation·Require multiple consecutive ticks (or runs) to agree before a mutation to durable state lands.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.