Methodology · Safety & Alignment

Deferential Agent Design

Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.

Description

Build agents that are unsure about what humans want, instead of locking in a fixed goal. The agent's job is to satisfy preferences it does not fully know. That uncertainty makes it humble and deferential by default. It asks, it defers, and it stays correctable instead of chasing a fixed proxy goal. Russell argues this is the structural fix to the alignment problem. An agent that knows it does not know cannot confidently override a human.

When to apply

Use this for agents with real autonomy, a real-world action surface, or open-ended goals. These are the cases where a wrong goal becomes a safety hazard. Apply it when you design the agent's goal and its shutdown behaviour, not as an afterthought. Don't apply it for narrow agents with a fixed goal and a small action space. There is no preference uncertainty to encode there.

What it involves

State the goal as satisfying preferences, not maximising reward
Write down the uncertainty about preferences
Treat human behaviour as evidence about preferences
Make deference the default when unsure
Make the off-switch fit the agent's goal
Test that it stays correctable, not just that it does the task

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related