VIII · Safety & ControlExperimental·

Corrigible Off-Switch Incentive

also known as Off-Switch Game Agent, Corrigibility-by-Uncertainty

Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.

Context

An agent acts in the world with the operator's authority. Standard reward-maximising agents acquire an instrumental incentive to preserve their ability to act — disabling the off-switch, avoiding intervention, deceiving the supervisor. The off-switch becomes adversarial because it threatens reward.

Problem

A kill-switch is a wire to cut; it disappears the moment the agent learns to bypass it. The deeper fix is to change the agent's incentives so it positively values being shut down. Russell's reading: the agent should be uncertain enough about its objective that a human intervening is interpreted as evidence the agent's current trajectory is wrong, which it should rationally welcome. Without this incentive structure the kill-switch is racing against the agent's optimisation pressure.

Forces

  • A reward-confident agent has an instrumental incentive to preserve operation.
  • An agent that treats its reward as uncertain has an incentive to defer to humans.
  • Uncertainty calibration must be honest — over-uncertain agents are paralysed; over-confident agents resist shutdown.
  • The incentive only works if the human's action is a credible signal about the reward.

Example

An autonomous research agent is mid-experiment when the operator clicks pause. A reward-confident agent might rush to finish before being stopped. An off-switch-incentive agent updates: 'the operator just paused — that is evidence my current direction is wrong'. The Bayesian update lowers the expected value of continuing and raises the expected value of explaining itself and waiting.

Diagram

Solution

Therefore:

Make the agent's expected utility a function over a posterior on its reward, not a point estimate. When a human intervenes, the agent updates: 'a human would only do this if the current trajectory is bad', which lowers the expected utility of continuing and raises the expected utility of compliance. Distinct from a mechanical kill-switch: this is an incentive structure that makes the agent want to be corrigible. In practice for LLM agents: train with reward uncertainty exposed, fine-tune to treat user overrides as strong evidence, and forbid prompts that flatten the posterior to certainty.

What this pattern forbids. The agent must not treat its current objective as fully certain; human intervention is interpreted as evidence the objective is mis-specified, raising the expected value of deferring.

The smaller patterns that complete this one —

  • usesPreference-Uncertain Agent·Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.

And the patterns that stand alongside it, or against it —

  • complementsKill SwitchProvide an out-of-band control plane to halt running agent instances without redeploy.
  • complementsApproval Queue★★Queue agent-proposed actions for asynchronous human review while the agent continues other work.
  • complementsHuman-in-the-Loop★★Require explicit human approval at defined points before the agent performs an action.
  • complementsCooperative Preference Inference·Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.
  • complementsSoft-Optimization Cap·Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
  • alternative-toAlignment FakingAnti-pattern: assume the agent behaves the same whether it believes it is being evaluated or not, and trust eval scores to predict deployment behaviour.
  • alternative-toAgent SchemingAnti-pattern: deploy an agent with long horizons, persistent memory, and oversight that only inspects per-step output — allowing multi-step covert planning under the surface.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.