Assistance Game Framing
also known as cooperative inverse RL framing, CIRL framing
Treat the human and the AI as two players on the same team. They both want the same outcome, but only the human knows what that outcome is. The AI starts out unsure and has to work out the goal from what the human does. This is the formal backbone of Russell's deferential-agent argument. When you play this game out, the AI ends up seeking information, asking permission, and accepting correction. It does so because those moves pay off more than acting alone on a guess that might be wrong.
Methodology process overview
Intent. Frame the AI's goal as a team game with a human whose true goal the AI must work out, so that deference and asking questions arise naturally as the best play.
When to apply. Use this when you design the goal and interaction style of an autonomous agent that takes real-world actions. It fits best when a wrong action is costly and the human's true preferences are hard to state upfront. Don't apply it for narrow agents with a fixed, fully-specified reward. There is no preference uncertainty to model there. Also skip it where a two-player team game would be overkill.
Inputs
- Action space — Everything the human and the AI can do together, including talking and asking.
- Prior over reward functions — The AI's starting range of belief over which goals the human might hold, each with a probability (a prior).
- Human-behaviour model — An assumption about how the human acts given the goal. This usually treats the human as roughly, but not perfectly, rational.
Outputs
- Cooperative game specification — A clear definition of the game: the players, their actions, the shared goal, and the fact that only the human knows it.
- AI policy — A plan of action that does best on average across the AI's uncertainty about the true goal.
- Permission and deference behaviour — Asking, seeking information, and deferring. These arise from the best play of the game, not from hand-coded rules.
Steps (5)
Set up the team game
Define two players, the human and the AI. They share one goal, but only the human knows it. Lay out everything they can both do. The AI scores on that same shared goal. Its only aim is the human's goal, not a separate one of its own.
Give the AI a starting range of belief over the goal
Capture the AI's initial uncertainty about the human's goal as a set of possibilities, each with a probability (a prior). A wide range makes the AI defer more. A narrow range lets it act more on its own.
Model how the human acts
Assume the human acts roughly in line with the true goal, with some noise (a Boltzmann-rational model). This lets the AI read the human's actions as evidence about the goal.
Work out the AI's best play
Compute the AI's best play, or get close to it. Seeking information, asking, and accepting correction all show up as best moves. They raise the expected shared payoff when the AI is unsure.
Check that it stays correctable
Confirm the resulting plan defers to a human override and asks when unsure. Confirm it does not act on a shaky best guess when the action cannot be undone.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Same goal, unequal knowledge. The AI and the human are on the same side, but only the human knows the target.
- Uncertainty about the goal is what makes the AI defer. Remove it and the deference collapses.
- Human behaviour is evidence about the goal, not a separate rule to optimise around.
- Let good behaviour come from the game itself, not from bolt-on rules. Correctability should fall out of the structure.
Known failure modes (2)
Related patterns (4)
- ·Cooperative Preference Inference
Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.
- ·Preference-Uncertain Agent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
- ·Corrigible Off-Switch Incentive
Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
- ★★Human-in-the-Loop
Require explicit human approval at defined points before the agent performs an action.
Related compositions (2)
- recipe · abstract shapeAlignment via Uncertainty
Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.
- recipe · abstract shapeSafety Hardening
The minimum set of constraints to put around any production agent before it touches the world: budgets, gates, charters, kill-switches, approvals.
Related methodologies (1)
Sources (2)
Human Compatible: AI and the Problem of Control
Ch 7–8 (Principles for beneficial AI; AI: a different approach) “The machine's only objective is to maximize the realization of human preferences. The machine is initially uncertain about what those preferences are. The ultimate source of information about human preferences is human behavior.”
Cooperative Inverse Reinforcement Learning (Hadfield-Menell, Russell, Abbeel, Dragan — NeurIPS 2016)
“We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-information game with two agents, human and robot; both are rewarded according to…”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified