Deferential Agent Design
also known as uncertainty-aware agent design, Russell three principles
Build agents that are unsure about what humans want, instead of locking in a fixed goal. The agent's job is to satisfy preferences it does not fully know. That uncertainty makes it humble and deferential by default. It asks, it defers, and it stays correctable instead of chasing a fixed proxy goal. Russell argues this is the structural fix to the alignment problem. An agent that knows it does not know cannot confidently override a human.
Methodology process overview
Intent. Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.
When to apply. Use this for agents with real autonomy, a real-world action surface, or open-ended goals. These are the cases where a wrong goal becomes a safety hazard. Apply it when you design the agent's goal and its shutdown behaviour, not as an afterthought. Don't apply it for narrow agents with a fixed goal and a small action space. There is no preference uncertainty to encode there.
Inputs
- Action space — The set of actions the agent can take. Pay special attention to the ones that cannot be undone.
- Human preference signals — Where the agent learns what people want. This includes direct feedback, demonstrations, ratings, and the choices people make.
- Stakeholder enumeration — Whose preferences the agent must satisfy and how much each one counts. This is usually more than just the operator.
Outputs
- Uncertainty-aware objective — A goal stated as a range of belief over what humans want, not a single fixed guess (a posterior, not a point estimate).
- Off-switch-friendly behaviour — An agent that does not block attempts to shut it down or override it. Being shut down fits with satisfying preferences it does not fully know.
- Preference-elicitation protocol — How the agent asks, infers, or defers when it is unsure what people want.
Steps (6)
State the goal as satisfying preferences, not maximising reward
The agent's final goal is to satisfy human preferences, and those preferences are openly uncertain. Do not hard-code any single reward function as the ultimate goal. Rewards are just proxies the agent keeps updating.
Write down the uncertainty about preferences
Hold the agent's beliefs about what people want as a range of possibilities, each with a probability. The agent acts to satisfy preferences on average across that whole range. It does not assume its best single guess is right.
Treat human behaviour as evidence about preferences
Feedback, demonstrations, choices, and corrections all sharpen the agent's belief about what people want. Inverse reinforcement learning, preference learning, and RLHF-style methods all fit here. The key commitment is that humans are the source of the signal.
Make deference the default when unsure
When the agent is very unsure what people want and the action cannot be undone, it asks or defers instead of acting. High uncertainty plus an irreversible action means pause.
Make the off-switch fit the agent's goal
Being shut down by a human should fit the agent's goal. If a human wants the agent off, that itself is evidence about what people want. So the agent has no reason to disable its off-switch.
Test that it stays correctable, not just that it does the task
Test the agent under attempted shutdowns, conflicting signals, and cases where its best guess disagrees with a human. It passes only if it defers. It fails if it cleverly works around the human.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- The machine's only objective is to maximise the realisation of human preferences.
- The machine is initially uncertain about what those preferences are.
- The ultimate source of information about human preferences is human behaviour.
- Uncertainty about preferences is what makes the agent defer. Remove it and you remove the safety.
Known failure modes (2)
Related patterns (5)
- ·Preference-Uncertain Agent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
- ·Cooperative Preference Inference
Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.
- ·Corrigible Off-Switch Incentive
Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
- ★★Human-in-the-Loop
Require explicit human approval at defined points before the agent performs an action.
- ★Kill Switch
Provide an out-of-band control plane to halt running agent instances without redeploy.
Related compositions (2)
- recipe · abstract shapeSafety Hardening
The minimum set of constraints to put around any production agent before it touches the world: budgets, gates, charters, kill-switches, approvals.
- recipe · abstract shapeLong-Running Autonomous Agent
An agent that operates over hours to weeks, surviving restarts and accumulating memory while remaining safe. The shape behind Devin, Manus, durable LangGraph runs.
Related methodologies (1)
Sources (2)
Human Compatible: AI and the Problem of Control
Three Principles of beneficial machines “machines that are provably deferential and provably beneficial”
Human Compatible (Wikipedia summary of Russell's Three Principles)
“The machine's only objective is to maximize the realization of human preferences. The machine is initially uncertain about what those preferences are. The ultimate source of information about human preferences is human behavior.”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified