Methodology · Safety & Alignment

Assistance Game Framing

Frame the AI's goal as a team game with a human whose true goal the AI must work out, so that deference and asking questions arise naturally as the best play.

Description

Treat the human and the AI as two players on the same team. They both want the same outcome, but only the human knows what that outcome is. The AI starts out unsure and has to work out the goal from what the human does. This is the formal backbone of Russell's deferential-agent argument. When you play this game out, the AI ends up seeking information, asking permission, and accepting correction. It does so because those moves pay off more than acting alone on a guess that might be wrong.

When to apply

Use this when you design the goal and interaction style of an autonomous agent that takes real-world actions. It fits best when a wrong action is costly and the human's true preferences are hard to state upfront. Don't apply it for narrow agents with a fixed, fully-specified reward. There is no preference uncertainty to model there. Also skip it where a two-player team game would be overkill.

What it involves

  • Set up the team game
  • Give the AI a starting range of belief over the goal
  • Model how the human acts
  • Work out the AI's best play
  • Check that it stays correctable

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related