Methodology · Safety & Alignment

Off-Switch Via Reward Uncertainty

Make accepting shutdown the best choice on average by design, through goal uncertainty, rather than through a separate rule the agent may learn to game.

Description

Make an agent willing to be shut down by keeping it unsure about its own goal. Do not bolt on a rule that says 'don't resist shutdown'. An agent that is unsure about the true goal reads a human's shutdown attempt as a sign. The sign says its planned action is worth less than it thought. Once it updates on that sign, accepting shutdown becomes the better choice on average. The off-switch then works because of the uncertainty, not because of a separate rule. That is what makes it hard to game.

When to apply

Use this when you design how an agent can be shut down, where the agent takes real actions and the stakes are meaningful. Apply it while you design the goal, at the same time as the assistance-game framing, not after. Don't apply it for agents whose reward is fully specified and verified. There is no uncertainty to use there. Also skip it where shutting down is trivial and unlikely to be resisted.

What it involves

  • Refuse to treat a hard-coded reward as the final goal
  • Write down the goal uncertainty
  • Treat shutdown attempts as evidence
  • Show that accepting shutdown is the better choice
  • Test attempts to game or disable the off-switch

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related