Off-Switch Via Reward Uncertainty
also known as corrigible off-switch design, reward-uncertain shutdown
Make an agent willing to be shut down by keeping it unsure about its own goal. Do not bolt on a rule that says 'don't resist shutdown'. An agent that is unsure about the true goal reads a human's shutdown attempt as a sign. The sign says its planned action is worth less than it thought. Once it updates on that sign, accepting shutdown becomes the better choice on average. The off-switch then works because of the uncertainty, not because of a separate rule. That is what makes it hard to game.
Methodology process overview
Intent. Make accepting shutdown the best choice on average by design, through goal uncertainty, rather than through a separate rule the agent may learn to game.
When to apply. Use this when you design how an agent can be shut down, where the agent takes real actions and the stakes are meaningful. Apply it while you design the goal, at the same time as the assistance-game framing, not after. Don't apply it for agents whose reward is fully specified and verified. There is no uncertainty to use there. Also skip it where shutting down is trivial and unlikely to be resisted.
Inputs
- Agent objective — The reward or preference structure the agent tries to satisfy.
- Reward-uncertainty representation — A set of plausible goals, each with a probability, that captures how unsure the agent is.
- Shutdown action — The human's action to stop or override the agent.
Outputs
- Corrigible shutdown incentive — An agent that does not disable, dodge, or resist its off-switch.
- Posterior-update rule for shutdown signals — A clear rule for how the agent revises its belief about the goal when it sees a shutdown attempt.
Steps (5)
Refuse to treat a hard-coded reward as the final goal
Treat any specific reward as a proxy the agent is unsure about, not as the truth. An agent that is sure of a fixed reward will resist any shutdown that lowers that reward.
Write down the goal uncertainty
Keep a set of plausible goals with probabilities, and update it as evidence arrives. The wider that set, the more the agent defers. Tune the width to match real uncertainty. Do not zero it out for convenience.
Treat shutdown attempts as evidence
Spell out how the agent revises its belief when a human tries to shut it down. The attempt lowers the chance that the agent's current action is the right one.
Show that accepting shutdown is the better choice
Check in the agent's decision model that accepting shutdown beats resisting it on average. This should hold once you combine real goal uncertainty with the shutdown-as-evidence update.
usesCorrigible Off-Switch IncentivePreference-Uncertain Agent
Test attempts to game or disable the off-switch
Run scenarios where the agent could dodge shutdown. These include hiding its intent, holding back effort, or manipulating the operator. Confirm that under goal uncertainty these moves pay off less than just accepting shutdown.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Build correctability into the design, not as a side rule. Side rules can be gamed.
- Shutdown is evidence about the goal, not a penalty to avoid.
- The wider the goal uncertainty, the more the agent defers. Tune it to real uncertainty.
- Test by trying to dodge the off-switch, not by checking that the button works.
Known failure modes (3)
- ✕Reward Hacking
Adding 'don't resist shutdown' as a hard rule on top of a confident reward — the agent learns to comply syntactically while undermining the intent.
- ✕Self-Exfiltration
A confident agent on a hard-coded reward has positive expected value from copying itself before shutdown; only reward uncertainty removes that incentive.
- ✕Alignment Faking
Without genuine reward uncertainty, deference looks like alignment in training and collapses under deployment pressure.
Related patterns (5)
- ·Corrigible Off-Switch Incentive
Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
- ·Preference-Uncertain Agent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
- ★Kill Switch
Provide an out-of-band control plane to halt running agent instances without redeploy.
- ★Interruptible Agent Execution
Treat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.
- ★Red-Team Sandbox Reproduction
Routinely re-reproduce canonical alignment-failure modes inside a sealed sandbox per release; treat the alignment regression suite as a deployment gate.
Related compositions (2)
- recipe · abstract shapeAlignment via Uncertainty
Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.
- recipe · abstract shapeSafety Hardening
The minimum set of constraints to put around any production agent before it touches the world: budgets, gates, charters, kill-switches, approvals.
Related methodologies (2)
- Deferential Agent Design★
Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.
- Assistance Game Framing★
Frame the AI's goal as a team game with a human whose true goal the AI must work out, so that deference and asking questions arise naturally as the best play.
Sources (2)
Human Compatible: AI and the Problem of Control
Ch 8 ('You can't fetch the coffee if you're dead') — book paraphrase only; verbatim text not retrieved “A machine that is uncertain about the true objective will defer to humans: it will accept correction, and it will allow itself to be switched off”
The Off-Switch Game (Hadfield-Menell, Dragan, Abbeel, Russell — IJCAI 2017)
“We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable t…”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified