Verification & Reflection

Self-Modification Diff Gate

Gate the agent's edits to its own code or rules through a separate critic persona that reviews the diff before it lands.

Problem

When self-edits are applied directly without a review step, the agent can silently rewrite its own future behaviour in irreversible ways, including past the very safety preamble that was supposed to constrain it. A bad edit is not noticed until the next time the agent runs and behaves strangely, by which time the previous version is gone. Asking the same model to review its own diff inside the same context tends to rationalise the change rather than evaluate it, because the model that just argued itself into making the edit will argue itself into approving it. The team needs an independent review step that runs before any self-edit lands.

Solution

Every self-edit goes through a critic step: a separate prompt (and optionally a separate model) reviews the proposed diff against criteria (safety, charter compliance, test passing). Edits land only on critic approval. Rejected edits are logged for later human review. The critic must run on a frozen checkpoint (separate process or sandbox) so a malformed self-edit cannot corrupt the critic before it votes; recursion guard is required when the critic itself is in the edit scope.

When to use

  • The agent edits its own code, prompts, or rules and bad edits would be hard to reverse.
  • A separate critic prompt or model can review proposed diffs against explicit criteria.
  • The critic can run on a frozen checkpoint, isolated from the edit scope.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related