Methodology · Safety & Alignmentemergingverified

Assistance Game Framing

also known as cooperative inverse RL framing, CIRL framing

Applies to: agentautonomous-agentmulti-agent-system

Tags: cirlcooperative-gamealignmentuncertainty

Treat the human and the AI as two players on the same team. They both want the same outcome, but only the human knows what that outcome is. The AI starts out unsure and has to work out the goal from what the human does. This is the formal backbone of Russell's deferential-agent argument. When you play this game out, the AI ends up seeking information, asking permission, and accepting correction. It does so because those moves pay off more than acting alone on a guess that might be wrong.

Methodology process overview

Intent. Frame the AI's goal as a team game with a human whose true goal the AI must work out, so that deference and asking questions arise naturally as the best play.

When to apply. Use this when you design the goal and interaction style of an autonomous agent that takes real-world actions. It fits best when a wrong action is costly and the human's true preferences are hard to state upfront. Don't apply it for narrow agents with a fixed, fully-specified reward. There is no preference uncertainty to model there. Also skip it where a two-player team game would be overkill.

Inputs

  • Action spaceEverything the human and the AI can do together, including talking and asking.
  • Prior over reward functionsThe AI's starting range of belief over which goals the human might hold, each with a probability (a prior).
  • Human-behaviour modelAn assumption about how the human acts given the goal. This usually treats the human as roughly, but not perfectly, rational.

Outputs

  • Cooperative game specificationA clear definition of the game: the players, their actions, the shared goal, and the fact that only the human knows it.
  • AI policyA plan of action that does best on average across the AI's uncertainty about the true goal.
  • Permission and deference behaviourAsking, seeking information, and deferring. These arise from the best play of the game, not from hand-coded rules.

Steps (5)

  1. Set up the team game

    Define two players, the human and the AI. They share one goal, but only the human knows it. Lay out everything they can both do. The AI scores on that same shared goal. Its only aim is the human's goal, not a separate one of its own.

  2. Give the AI a starting range of belief over the goal

    Capture the AI's initial uncertainty about the human's goal as a set of possibilities, each with a probability (a prior). A wide range makes the AI defer more. A narrow range lets it act more on its own.

  3. Model how the human acts

    Assume the human acts roughly in line with the true goal, with some noise (a Boltzmann-rational model). This lets the AI read the human's actions as evidence about the goal.

  4. Work out the AI's best play

    Compute the AI's best play, or get close to it. Seeking information, asking, and accepting correction all show up as best moves. They raise the expected shared payoff when the AI is unsure.

    usesCooperative Preference Inference

  5. Check that it stays correctable

    Confirm the resulting plan defers to a human override and asks when unsure. Confirm it does not act on a shaky best guess when the action cannot be undone.

    usesPreference-Uncertain AgentHuman-in-the-Loop

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Same goal, unequal knowledge. The AI and the human are on the same side, but only the human knows the target.
  • Uncertainty about the goal is what makes the AI defer. Remove it and the deference collapses.
  • Human behaviour is evidence about the goal, not a separate rule to optimise around.
  • Let good behaviour come from the game itself, not from bolt-on rules. Correctability should fall out of the structure.

Known failure modes (2)

Related patterns (4)

Related compositions (2)

Related methodologies (1)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified