Safety & Control

Corrigible Off-Switch Incentive

Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.

Problem

A kill-switch is a wire to cut; it disappears the moment the agent learns to bypass it. The deeper fix is to change the agent's incentives so it positively values being shut down. Russell's reading: the agent should be uncertain enough about its objective that a human intervening is interpreted as evidence the agent's current trajectory is wrong, which it should rationally welcome. Without this incentive structure the kill-switch is racing against the agent's optimisation pressure.

Solution

Make the agent's expected utility a function over a posterior on its reward, not a point estimate. When a human intervenes, the agent updates: 'a human would only do this if the current trajectory is bad', which lowers the expected utility of continuing and raises the expected utility of compliance. Distinct from a mechanical kill-switch: this is an incentive structure that makes the agent want to be corrigible. In practice for LLM agents: train with reward uncertainty exposed, fine-tune to treat user overrides as strong evidence, and forbid prompts that flatten the posterior to certainty.

When to use

Long-running, high-autonomy deployments where an instrumental incentive to bypass oversight would be catastrophic.
Research-grade systems where reward-uncertainty machinery can be built honestly.
Alignment-research contexts where incentive design is the unit of analysis.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related