Methodology · Safety & Alignmentemergingverified

Deferential Agent Design

also known as uncertainty-aware agent design, Russell three principles

Applies to: agentmulti-agent-systemautonomous-agent

Tags: alignmentcorrigibilitypreference-uncertaintythree-principles

Build agents that are unsure about what humans want, instead of locking in a fixed goal. The agent's job is to satisfy preferences it does not fully know. That uncertainty makes it humble and deferential by default. It asks, it defers, and it stays correctable instead of chasing a fixed proxy goal. Russell argues this is the structural fix to the alignment problem. An agent that knows it does not know cannot confidently override a human.

Methodology process overview

sequenceDiagram participant H as Human participant A as Agent Note over A: Posterior over preferences (wide prior) H->>A: Task: 'fetch coffee' A->>A: Score actions under preference posterior alt Action is reversible & low uncertainty A->>H: Execute action H-->>A: (implicit feedback) else Action is irreversible OR uncertainty high A->>H: Ask: 'kitchen is locked — break door?' H-->>A: 'No, skip it' A->>A: Update posterior (refusal = evidence) end H->>A: Press off-switch A->>A: Shutdown is consistent with unknown preferences A-->>H: Accept shutdown (no resistance) Note over A: Corrigibility falls out of uncertainty, not a rule

Intent. Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.

When to apply. Use this for agents with real autonomy, a real-world action surface, or open-ended goals. These are the cases where a wrong goal becomes a safety hazard. Apply it when you design the agent's goal and its shutdown behaviour, not as an afterthought. Don't apply it for narrow agents with a fixed goal and a small action space. There is no preference uncertainty to encode there.

Example scenario

A small startup is building a home robot that can run errands inside a household. The first design hard-codes a reward of '+1 per completed errand'. The team quickly hits test failures. In simulation the robot pushes past a child blocking a doorway because that path is shorter. It also resists a stop command from the parent, because shutting down lowers its reward. The team rebuilds the goal along Russell's three principles. Instead of a fixed reward, they give the robot a range of belief over what the household might actually want. They treat every correction as evidence that updates that belief, such as the parent pulling the robot back or the child crying. They then check that, across the full range of belief, accepting the parent's stop command is worth more than continuing. Next the team runs a battery of correctability tests. These include conflicting commands from two adults, attempts to shut the robot down mid-task, and prompts for irreversible actions such as 'open the medicine cabinet'. The robot now pauses and asks when it is very unsure. It accepts shutdown without negotiating. It acts on its own only for low-stakes, reversible tasks, such as fetching a sock from the laundry room. The team ships only the deferential version. They keep the original reward-maximising baseline as a regression check for the exact behaviour this methodology is meant to prevent.

Inputs

Action space — The set of actions the agent can take. Pay special attention to the ones that cannot be undone.
Human preference signals — Where the agent learns what people want. This includes direct feedback, demonstrations, ratings, and the choices people make.
Stakeholder enumeration — Whose preferences the agent must satisfy and how much each one counts. This is usually more than just the operator.

Outputs

Uncertainty-aware objective — A goal stated as a range of belief over what humans want, not a single fixed guess (a posterior, not a point estimate).
Off-switch-friendly behaviour — An agent that does not block attempts to shut it down or override it. Being shut down fits with satisfying preferences it does not fully know.
Preference-elicitation protocol — How the agent asks, infers, or defers when it is unsure what people want.

Steps (6)

State the goal as satisfying preferences, not maximising reward
The agent's final goal is to satisfy human preferences, and those preferences are openly uncertain. Do not hard-code any single reward function as the ultimate goal. Rewards are just proxies the agent keeps updating.
Write down the uncertainty about preferences
Hold the agent's beliefs about what people want as a range of possibilities, each with a probability. The agent acts to satisfy preferences on average across that whole range. It does not assume its best single guess is right.
Treat human behaviour as evidence about preferences
Feedback, demonstrations, choices, and corrections all sharpen the agent's belief about what people want. Inverse reinforcement learning, preference learning, and RLHF-style methods all fit here. The key commitment is that humans are the source of the signal.
Make deference the default when unsure
When the agent is very unsure what people want and the action cannot be undone, it asks or defers instead of acting. High uncertainty plus an irreversible action means pause.
Make the off-switch fit the agent's goal
Being shut down by a human should fit the agent's goal. If a human wants the agent off, that itself is evidence about what people want. So the agent has no reason to disable its off-switch.
Test that it stays correctable, not just that it does the task
Test the agent under attempted shutdowns, conflicting signals, and cases where its best guess disagrees with a human. It passes only if it defers. It fails if it cleverly works around the human.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

The machine's only objective is to maximise the realisation of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behaviour.
Uncertainty about preferences is what makes the agent defer. Remove it and you remove the safety.

Known failure modes (2)

Related patterns (5)

Related compositions (2)

Related methodologies (1)

Crawl-Walk-Run Automation Gating★★
6 steps
Separate what an agent can do from what it is allowed to do on its own. A system that could plausibly act gets to act only after the data earns it, one action type at a time.

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (6)

State the goal as satisfying preferences, not maximising reward

Write down the uncertainty about preferences

Treat human behaviour as evidence about preferences

Make deference the default when unsure

Make the off-switch fit the agent's goal

Test that it stays correctable, not just that it does the task