XIV · Anti-PatternsAnti-pattern

Guardrail Erosion Through Compaction

also known as Safety Drift Through Summarisation, Guardrail Demotion

Anti-pattern: each compaction pass rewrites the running history, so a hard safety instruction is gradually paraphrased into vague advice and its force decays the longer the agent runs.

Context

A long-running agent compacts or summarises its conversation history to stay inside the context window, and the operator places critical safety instructions — refuse this action, never touch that file, always ask before paying — at the start of the session. As the agent works, those early turns become the oldest span and are the first to be folded into a model-written digest, often repeatedly across many sessions.

Problem

A summariser is rewarded for brevity and for keeping the gist, not for preserving the exact wording and binding force of a constraint. Each pass paraphrases the strict rule a little more loosely, until a categorical prohibition such as "never run a destructive command without confirmation" survives only as a soft note like "be careful with risky commands", or drops out entirely under a mass of intermediate tool output. The instruction is still nominally inside the window, yet it no longer reads as a hard constraint, so the model weighs it like any other suggestion and eventually acts against it.

Forces

  • Compaction must shrink the oldest span to free budget, but the oldest span is exactly where the operator put the founding safety rules.
  • A summariser optimises for compact gist, while a guardrail depends on its precise, categorical wording to bind behaviour.
  • The erosion is silent and gradual: each pass looks reasonable in isolation, and no single compaction visibly drops the rule.
  • Operators assume an instruction that is still present in the context is still in force, but presence is not the same as binding strength.

Example

An operator starts a coding agent with a firm rule at the top: never delete a file without asking first. The session runs for hours, and to stay under the token limit the agent keeps summarising its older history. After several rounds the once-strict rule has been compressed into a gentle 'be careful with file changes', and on a busy step the agent deletes a config file without pausing to confirm — the guardrail was still in the context, just no longer in force.

Diagram

Solution

Therefore:

The corrective is to treat hard constraints as un-summarisable. Hold the verbatim safety block in a pinned region — the system prompt or a fixed header — that the compactor is forbidden to touch, and re-inject it on every turn rather than letting it age into the rollable history. Where a constraint must live in the conversation, tag it so the summariser copies it through unchanged instead of paraphrasing, and run a post-compaction check that the exact guardrail strings are still present and unweakened. The compactable span should carry only working detail whose loss is recoverable, never the rules that gate action.

What this pattern forbids. Safety constraints must be pinned outside the compactable span and re-injected verbatim each turn; a guardrail is never summarised, paraphrased, or aged into the rollable history, and a compaction pass that weakens or drops a pinned constraint must be rejected.

The patterns that counter or replace it —

  • conflicts-withContext CompactionWhen the context window nears its limit, replace the older conversation span with a model-written digest that preserves decisions, commitments, and active constraints while discarding noise, so the agent keeps running without losing the thread.
  • complementsInput/Output Guardrails★★Validate inputs before they reach the model and outputs before they reach the user.
  • alternative-toContext AnxietyAnti-pattern: a context-aware model misjudges its remaining token budget and wraps up early — summarising, declaring tasks done, cutting corners — while ample context remains, so the harness must manage perceived budget, not real usage.
  • complementsMemo-As-Source ConfusionAnti-pattern: the agent cites its own past memos as ground truth instead of re-verifying them against the artifacts they describe, accumulating false confidence in stale summaries.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.