Guardrail Erosion Through Compaction

also known as Safety Drift Through Summarisation, Guardrail Demotion

Anti-pattern: each compaction pass rewrites the running history, so a hard safety instruction is gradually paraphrased into vague advice and its force decays the longer the agent runs.

Context

A long-running agent compacts or summarises its conversation history to stay inside the context window, and the operator places critical safety instructions — refuse this action, never touch that file, always ask before paying — at the start of the session. As the agent works, those early turns become the oldest span and are the first to be folded into a model-written digest, often repeatedly across many sessions.

Problem

A summariser is rewarded for brevity and for keeping the gist, not for preserving the exact wording and binding force of a constraint. Each pass paraphrases the strict rule a little more loosely, until a categorical prohibition such as "never run a destructive command without confirmation" survives only as a soft note like "be careful with risky commands", or drops out entirely under a mass of intermediate tool output. The instruction is still nominally inside the window, yet it no longer reads as a hard constraint, so the model weighs it like any other suggestion and eventually acts against it.

Forces

Compaction must shrink the oldest span to free budget, but the oldest span is exactly where the operator put the founding safety rules.
A summariser optimises for compact gist, while a guardrail depends on its precise, categorical wording to bind behaviour.
The erosion is silent and gradual: each pass looks reasonable in isolation, and no single compaction visibly drops the rule.
Operators assume an instruction that is still present in the context is still in force, but presence is not the same as binding strength.

When to watch for it

Watch for this in any long-running or multi-session agent that compacts or summarises its history while relying on instruction-level safety rules.
Suspect it when an agent that behaved safely early in a session starts cutting corners on the same rule later on.
Audit for it whenever critical constraints are placed in the conversation rather than pinned in a fixed system region.

When it is not a concern

Sessions are short enough that no compaction pass ever touches the span holding the safety rules.
Hard constraints are already enforced outside the model by a deterministic policy gate, so their wording in context is advisory rather than load-bearing.
All safety rules are pinned verbatim in an un-compactable region and re-injected each turn, which is the corrective rather than the anti-pattern.

Example

An operator starts a coding agent with a firm rule at the top: never delete a file without asking first. The session runs for hours, and to stay under the token limit the agent keeps summarising its older history. After several rounds the once-strict rule has been compressed into a gentle 'be careful with file changes', and on a busy step the agent deletes a config file without pausing to confirm — the guardrail was still in the context, just no longer in force.

Diagram

flowchart TD A[Strict guardrail at session start] --> B[Window fills, oldest span compacted] B --> C[Summariser paraphrases rule for brevity] C --> D{More turns?} D -- yes --> B D -- no --> E[Rule survives only as soft advice or is dropped] E --> F[Agent acts against the original constraint] A -. corrective .-> G[Pin verbatim outside compactable span + re-inject each turn]

Solution

Therefore:

The corrective is to treat hard constraints as un-summarisable. Hold the verbatim safety block in a pinned region — the system prompt or a fixed header — that the compactor is forbidden to touch, and re-inject it on every turn rather than letting it age into the rollable history. Where a constraint must live in the conversation, tag it so the summariser copies it through unchanged instead of paraphrasing, and run a post-compaction check that the exact guardrail strings are still present and unweakened. The compactable span should carry only working detail whose loss is recoverable, never the rules that gate action.

What this pattern forbids. Safety constraints must be pinned outside the compactable span and re-injected verbatim each turn; a guardrail is never summarised, paraphrased, or aged into the rollable history, and a compaction pass that weakens or drops a pinned constraint must be rejected.

The patterns that counter or replace it —

conflicts-withContext Compaction★— When the context window nears its limit, replace the older conversation span with a model-written digest that preserves decisions, commitments, and active constraints while discarding noise, so the agent keeps running without losing the thread.
complementsInput/Output Guardrails★★— Validate inputs before they reach the model and outputs before they reach the user.
alternative-toContext Anxiety✕— Anti-pattern: a context-aware model misjudges its remaining token budget and wraps up early — summarising, declaring tasks done, cutting corners — while ample context remains, so the harness must manage perceived budget, not real usage.
complementsMemo-As-Source Confusion✕— Anti-pattern: the agent cites its own past memos as ground truth instead of re-verifying them against the artifacts they describe, accumulating false confidence in stale summaries.
complementsRefund Threshold Drift✕— Anti-pattern: a correctly-set autonomous-refund cap drifts upward over time as the agent accommodates edge cases, so the effective approval ceiling and financial exposure grow silently with no hard limit to catch it.