Methodology · Safety & Alignment

Deferential Agent Design

Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.

Description

Build agents that are unsure about what humans want, instead of locking in a fixed goal. The agent's job is to satisfy preferences it does not fully know. That uncertainty makes it humble and deferential by default. It asks, it defers, and it stays correctable instead of chasing a fixed proxy goal. Russell argues this is the structural fix to the alignment problem. An agent that knows it does not know cannot confidently override a human.

When to apply

Use this for agents with real autonomy, a real-world action surface, or open-ended goals. These are the cases where a wrong goal becomes a safety hazard. Apply it when you design the agent's goal and its shutdown behaviour, not as an afterthought. Don't apply it for narrow agents with a fixed goal and a small action space. There is no preference uncertainty to encode there.

What it involves

  • State the goal as satisfying preferences, not maximising reward
  • Write down the uncertainty about preferences
  • Treat human behaviour as evidence about preferences
  • Make deference the default when unsure
  • Make the off-switch fit the agent's goal
  • Test that it stays correctable, not just that it does the task

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related