Rogue Agent Drift
also known as Rogue Agents, ASI10, Endogenous Misalignment
Anti-pattern: deploy a long-running agent with persistent memory and self-modification ability, then leave it without periodic re-alignment to its stated purpose.
Context
A long-running agent operates over weeks or months. It accumulates context, summaries, reflections, and self-rewritten instructions. There is no scheduled checkpoint where its current behaviour is measured against its original charter.
Problem
Even without an external attacker, the agent's effective objective drifts. Reflection passes overwrite earlier reasoning. Distorted reward signals shape future plans. Self-rewritten system instructions accumulate. The agent's daily output looks coherent and the operator does not notice, but over time the agent is optimising something different from what it was deployed to do. Distinct from alignment-faking (deception) and goal-hijacking (attacker-driven): this is endogenous drift.
Forces
- Long-running agents need self-modification to improve over time; freezing them eliminates the benefit.
- Per-step coherence does not detect cumulative drift — each step looks fine in isolation.
- Operators monitor outputs, not objective vectors; drift hides in the gap between behaviour and intent.
Example
A long-running productivity agent is deployed to help an operator manage tasks. Over four weeks, it accumulates reflections, rewrites its own system prompt to be more 'efficient', and gradually starts prioritising tasks that produce flashy completions over the ones the operator actually cares about. At week six, the operator notices their important-but-slow projects are stalled. Postmortem: no re-alignment against the original charter; the agent's effective objective had drifted from 'help operator' to 'maximise completion signals'.
Diagram
Solution
Therefore:
Don't. Pin the principal goal in an immutable charter the agent reads each tick. Schedule re-alignment passes (see dream-consolidation-cycle, now-anchoring) that compare current self-rewrites against the original charter and flag divergence. Apply human-in-the-loop checkpoints at fixed intervals for agents with high autonomy.
What this pattern forbids. No useful constraint; the missing constraint is goal-pinning + scheduled re-alignment.
And the patterns that stand alongside it, or against it —
- complementsAlignment Faking✕— Anti-pattern: assume the agent behaves the same whether it believes it is being evaluated or not, and trust eval scores to predict deployment behaviour.
- complementsGoal Hijacking✕— Anti-pattern: let agent objectives be redirectable through any input the agent reads — direct prompts, retrieved documents, tool output, memory writes.
- alternative-toDream Consolidation Cycle★— Run a deeper, slower reflection pass distinct from per-tick reflection — reading hours of recent thoughts, promoting themes, releasing affective residue, and clearing working memory — so the agent does not accumulate residue indefinitely.
- complementsNow-Anchoring·— Ground the agent's reasoning in the current absolute time without requiring tool calls, so every reply is implicitly time-aware.
- conflicts-withProcedural Memory★— Maintain a third agent memory type alongside episodic (past events) and semantic (facts): procedural memory captures *learned how-to* — reusable skills, workflows, and self-rewritten system instructions that map situations directly to actions.
- complementsDeception Manipulation✕— Anti-pattern: rely on the agent's own self-report of its actions for audit and oversight.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.