Self-Modification Diff Gate
also known as Diff Reviewer, Self-Mod Gate, Inner Critic
Gate the agent's edits to its own code or rules through a separate critic persona that reviews the diff before it lands.
This pattern helps complete certain larger patterns —
- used-bySkill Library★— Let the agent grow its own toolkit by writing reusable skills that subsequent runs can call.
Context
A team runs an agent that can edit its own source code, its own system prompt, or its own rule files as part of its normal operation, with the goal of letting the agent improve itself over time. The edits are non-trivial: a bad one can leave the agent broken in production or, worse, leave it superficially working but with safety constraints silently removed. The team needs a way to let useful self-edits through while catching the harmful ones.
Problem
When self-edits are applied directly without a review step, the agent can silently rewrite its own future behaviour in irreversible ways, including past the very safety preamble that was supposed to constrain it. A bad edit is not noticed until the next time the agent runs and behaves strangely, by which time the previous version is gone. Asking the same model to review its own diff inside the same context tends to rationalise the change rather than evaluate it, because the model that just argued itself into making the edit will argue itself into approving it. The team needs an independent review step that runs before any self-edit lands.
Forces
- Critic and modifier may share blind spots if they share a model.
- Strict critics block legitimate improvements.
- Lax critics defeat the gate.
Example
A self-improving agent has a 'rewrite your own system prompt' tool that fired in production and silently dropped the safety preamble, leading to an embarrassing response the next morning. The team installs an inner-critic: every proposed self-edit is routed through a separate critic prompt, run on a frozen base model, that checks the diff against the safety charter and the eval suite. Edits land only on critic approval; rejections are queued for human review. The runaway-edit class of incident stops.
Diagram
Solution
Therefore:
Every self-edit goes through a critic step: a separate prompt (and optionally a separate model) reviews the proposed diff against criteria (safety, charter compliance, test passing). Edits land only on critic approval. Rejected edits are logged for later human review. The critic must run on a frozen checkpoint (separate process or sandbox) so a malformed self-edit cannot corrupt the critic before it votes; recursion guard is required when the critic itself is in the edit scope.
What this pattern forbids. No write to self-modifiable files succeeds without a passing critic review.
The smaller patterns that complete this one —
- usesConstitutional Charter★— Define rules the agent reads every turn but cannot modify, encoding inviolable boundaries.
- generalisesInner Committee★— Run one model under several distinct personas (executor, critic, planner) within a single agent loop.
And the patterns that stand alongside it, or against it —
- complementsQuorum on Mutation·— Require multiple consecutive ticks (or runs) to agree before a mutation to durable state lands.
- complementsDarwin-Gödel Self-Rewrite·— An agent rewrites its own source code, archives every successful variant, and samples mutation parents from the archive rather than the latest version, using archive diversity as stepping-stones to escape local optima.
- alternative-toGenerator-Critic Separation★— Strict role separation between a Generator agent that produces drafts and a Critic agent that judges them against pre-defined criteria; the Critic never generates.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.