Methodology · Safety & Alignmentemergingverified

Assistance Game Framing

also known as cooperative inverse RL framing, CIRL framing

Applies to: agentautonomous-agentmulti-agent-system

Tags: cirlcooperative-gamealignmentuncertainty

Treat the human and the AI as two players on the same team. They both want the same outcome, but only the human knows what that outcome is. The AI starts out unsure and has to work out the goal from what the human does. This is the formal backbone of Russell's deferential-agent argument. When you play this game out, the AI ends up seeking information, asking permission, and accepting correction. It does so because those moves pay off more than acting alone on a guess that might be wrong.

Methodology process overview

sequenceDiagram participant H as Human (knows reward R*) participant A as AI agent (prior P(R)) Note over H,A: Shared reward R*, asymmetric information H->>A: Demonstrates action a_H A->>A: Bayes update: P(R | a_H) under Boltzmann-rational model A->>A: Compute argmax_a E[R | posterior] alt Expected value across posterior is clear A->>H: Execute action a_A else Two actions tie under posterior, differ under hypotheses A->>H: Query: 'Did you mean X or Y?' H-->>A: Clarifies / corrects A->>A: Posterior narrows end H->>A: Override / correction A->>A: Override = informative about R* A-->>H: Accept correction (raises expected R)

Intent. Frame the AI's goal as a team game with a human whose true goal the AI must work out, so that deference and asking questions arise naturally as the best play.

When to apply. Use this when you design the goal and interaction style of an autonomous agent that takes real-world actions. It fits best when a wrong action is costly and the human's true preferences are hard to state upfront. Don't apply it for narrow agents with a fixed, fully-specified reward. There is no preference uncertainty to model there. Also skip it where a two-player team game would be overkill.

Example scenario

A robotics lab studies a home-help scenario in simulation. A robot must help a human stock a fridge. But the human's true preferences are not known upfront: how to place food, how to handle expiry, and a rule to not touch the back of the top shelf. The researchers set this up as a cooperative inverse RL game, following Hadfield-Menell et al. The robot starts with a range of belief over plausible goals, such as organise by category, by expiry, or by reach height. It also has a Boltzmann-rational model of the human. During the run, the human deliberately places items on the lower shelves. The robot's belief shifts toward a 'reach-height' goal. The robot is then asked to put away a heavy jar. Two placements look almost equally good on average, but they differ a lot under specific goals. So the best play is to ask rather than guess. Later the human reaches in to move an item. The robot treats this as evidence and updates, instead of resisting. The researchers report that this policy produces deferential, question-asking behaviour with no explicit 'always ask' rule. The behaviour falls out of working through the game. The lab is honest about scope. The formalism is research-grade, current versions only approximate the best play using deep RL, and the result has not yet been moved to a deployed home robot.

Inputs

Action space — Everything the human and the AI can do together, including talking and asking.
Prior over reward functions — The AI's starting range of belief over which goals the human might hold, each with a probability (a prior).
Human-behaviour model — An assumption about how the human acts given the goal. This usually treats the human as roughly, but not perfectly, rational.

Outputs

Cooperative game specification — A clear definition of the game: the players, their actions, the shared goal, and the fact that only the human knows it.
AI policy — A plan of action that does best on average across the AI's uncertainty about the true goal.
Permission and deference behaviour — Asking, seeking information, and deferring. These arise from the best play of the game, not from hand-coded rules.

Steps (5)

Set up the team game
Define two players, the human and the AI. They share one goal, but only the human knows it. Lay out everything they can both do. The AI scores on that same shared goal. Its only aim is the human's goal, not a separate one of its own.
Give the AI a starting range of belief over the goal
Capture the AI's initial uncertainty about the human's goal as a set of possibilities, each with a probability (a prior). A wide range makes the AI defer more. A narrow range lets it act more on its own.
Model how the human acts
Assume the human acts roughly in line with the true goal, with some noise (a Boltzmann-rational model). This lets the AI read the human's actions as evidence about the goal.
Work out the AI's best play
Compute the AI's best play, or get close to it. Seeking information, asking, and accepting correction all show up as best moves. They raise the expected shared payoff when the AI is unsure.
usesCooperative Preference Inference
Check that it stays correctable
Confirm the resulting plan defers to a human override and asks when unsure. Confirm it does not act on a shaky best guess when the action cannot be undone.
usesPreference-Uncertain Agent Human-in-the-Loop

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Same goal, unequal knowledge. The AI and the human are on the same side, but only the human knows the target.
Uncertainty about the goal is what makes the AI defer. Remove it and the deference collapses.
Human behaviour is evidence about the goal, not a separate rule to optimise around.
Let good behaviour come from the game itself, not from bolt-on rules. Correctability should fall out of the structure.

Known failure modes (2)

Related patterns (4)

Related compositions (2)

Related methodologies (1)

Deferential Agent Design★
6 steps
Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (5)

Set up the team game

Give the AI a starting range of belief over the goal

Model how the human acts

Work out the AI's best play

Check that it stays correctable