Guardrail Erosion Through Compaction
Anti-pattern: each compaction pass rewrites the running history, so a hard safety instruction is gradually paraphrased into vague advice and its force decays the longer the agent runs.
Problem
A summariser is rewarded for brevity and for keeping the gist, not for preserving the exact wording and binding force of a constraint. Each pass paraphrases the strict rule a little more loosely, until a categorical prohibition such as "never run a destructive command without confirmation" survives only as a soft note like "be careful with risky commands", or drops out entirely under a mass of intermediate tool output. The instruction is still nominally inside the window, yet it no longer reads as a hard constraint, so the model weighs it like any other suggestion and eventually acts against it.
Solution
The corrective is to treat hard constraints as un-summarisable. Hold the verbatim safety block in a pinned region — the system prompt or a fixed header — that the compactor is forbidden to touch, and re-inject it on every turn rather than letting it age into the rollable history. Where a constraint must live in the conversation, tag it so the summariser copies it through unchanged instead of paraphrasing, and run a post-compaction check that the exact guardrail strings are still present and unweakened. The compactable span should carry only working detail whose loss is recoverable, never the rules that gate action.
When to use
- Watch for this in any long-running or multi-session agent that compacts or summarises its history while relying on instruction-level safety rules.
- Suspect it when an agent that behaved safely early in a session starts cutting corners on the same rule later on.
- Audit for it whenever critical constraints are placed in the conversation rather than pinned in a fixed system region.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.