Methodology · Safety & Alignmentemergingverified

Off-Switch Via Reward Uncertainty

also known as corrigible off-switch design, reward-uncertain shutdown

Applies to: agentautonomous-agentmulti-agent-system

Tags: corrigibilityoff-switchreward-uncertaintyshutdown

Make an agent willing to be shut down by keeping it unsure about its own goal. Do not bolt on a rule that says 'don't resist shutdown'. An agent that is unsure about the true goal reads a human's shutdown attempt as a sign. The sign says its planned action is worth less than it thought. Once it updates on that sign, accepting shutdown becomes the better choice on average. The off-switch then works because of the uncertainty, not because of a separate rule. That is what makes it hard to game.

Methodology process overview

Intent. Make accepting shutdown the best choice on average by design, through goal uncertainty, rather than through a separate rule the agent may learn to game.

When to apply. Use this when you design how an agent can be shut down, where the agent takes real actions and the stakes are meaningful. Apply it while you design the goal, at the same time as the assistance-game framing, not after. Don't apply it for agents whose reward is fully specified and verified. There is no uncertainty to use there. Also skip it where shutting down is trivial and unlikely to be resisted.

Inputs

  • Agent objectiveThe reward or preference structure the agent tries to satisfy.
  • Reward-uncertainty representationA set of plausible goals, each with a probability, that captures how unsure the agent is.
  • Shutdown actionThe human's action to stop or override the agent.

Outputs

  • Corrigible shutdown incentiveAn agent that does not disable, dodge, or resist its off-switch.
  • Posterior-update rule for shutdown signalsA clear rule for how the agent revises its belief about the goal when it sees a shutdown attempt.

Steps (5)

  1. Refuse to treat a hard-coded reward as the final goal

    Treat any specific reward as a proxy the agent is unsure about, not as the truth. An agent that is sure of a fixed reward will resist any shutdown that lowers that reward.

  2. Write down the goal uncertainty

    Keep a set of plausible goals with probabilities, and update it as evidence arrives. The wider that set, the more the agent defers. Tune the width to match real uncertainty. Do not zero it out for convenience.

  3. Treat shutdown attempts as evidence

    Spell out how the agent revises its belief when a human tries to shut it down. The attempt lowers the chance that the agent's current action is the right one.

  4. Show that accepting shutdown is the better choice

    Check in the agent's decision model that accepting shutdown beats resisting it on average. This should hold once you combine real goal uncertainty with the shutdown-as-evidence update.

    usesCorrigible Off-Switch IncentivePreference-Uncertain Agent

  5. Test attempts to game or disable the off-switch

    Run scenarios where the agent could dodge shutdown. These include hiding its intent, holding back effort, or manipulating the operator. Confirm that under goal uncertainty these moves pay off less than just accepting shutdown.

    usesRed-Team Sandbox ReproductionKill Switch

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Build correctability into the design, not as a side rule. Side rules can be gamed.
  • Shutdown is evidence about the goal, not a penalty to avoid.
  • The wider the goal uncertainty, the more the agent defers. Tune it to real uncertainty.
  • Test by trying to dodge the off-switch, not by checking that the button works.

Known failure modes (3)

Related patterns (5)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified