Methodology · Safety & Alignmentemergingverified

Preference Elicitation From Behavior Via IRL

also known as inverse RL preference elicitation, behaviour-grounded preference learning

Applies to: agentautonomous-agentmulti-agent-system

Tags: irlpreference-learningbehaviour-groundedposterior-uncertainty

Work out what a human wants by watching what they do, instead of asking them to write it down. This uses inverse reinforcement learning. Demonstrations, choices, corrections, and refusals are the signal. The agent keeps a range of belief over plausible goals and updates it as evidence comes in. The key discipline is to never narrow that range down to a single sure answer. A confident single guess is the same failure that hard-coded rewards cause.

Methodology process overview

Intent. Work out the human's goal from their behaviour using inverse RL, while keeping real uncertainty so the agent stays deferential.

When to apply. Use this when your users cannot state a goal in words but can show, choose, or correct. Apply it when a wrong goal is costly and you have behavioural data, such as chat, demonstrations, edits, and ratings. Don't apply it when the goal can be stated exactly, as in well-defined optimisation tasks. Also skip it when behavioural data is too thin to pin anything down, since the result just collapses back to the starting assumptions.

Inputs

  • Behavioural trace corpusDemonstrations, choices, corrections, and refusals from humans using the agent or an earlier system.
  • Hypothesis space of reward functionsThe family of possible goals the true goal might be one of.
  • Rationality assumptionA model of how closely the human follows the goal. This ranges from a simple noisy-rational model to a richer model of how people think.

Outputs

  • Posterior over reward functionsA range of plausible goals, each with a probability, that fits the observed behaviour and has not been forced down to one answer.
  • Agent policyA plan of action that does best on average across the whole range of belief, not on a single best guess.
  • Active-elicitation protocolAn optional way for the agent to ask for a demonstration or clarification that cuts its uncertainty the most.

Steps (6)

  1. Collect behavioural traces

    Gather demonstrations, choices, corrections, refusals, and rejections. Refusals and corrections tell you the most. They mark off regions the goal clearly rules out.

  2. Fit a range of belief over goals using inverse RL

    Use inverse RL, or a Bayesian version of it, to fit a range of goals that fit the traces under your chosen rationality model. Resist tuning the starting assumptions until the range collapses to one answer.

  3. Act across the full range, not on one guess

    The agent's plan should do best on average across the whole range of belief. Acting on the single most likely goal while the range is still wide is the exact failure this methodology exists to prevent.

    usesPreference-Uncertain Agent

  4. Update on every new behavioural signal

    Treat every correction, refusal, and choice as fresh evidence. The range of belief tightens or shifts, and the plan adjusts to match.

    usesCooperative Preference Inference

  5. Ask when the uncertainty actually matters

    When two actions look about equally good on average but differ a lot under specific goals, the agent should ask or demonstrate rather than guess.

    usesHuman-in-the-LoopApproval Queue

  6. Watch for the range collapsing to certainty

    Check now and then how spread out the range of belief still is. If it has collapsed to near-certainty, there are two causes. Either the evidence is genuinely very strong, which is rare. Or a wrong rationality model drove it there, which is common. Investigate before you trust it.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Behaviour is the real evidence for preferences, not what people say about themselves.
  • Never let the range of belief collapse to one sure answer. A confident single guess is the failure mode.
  • Corrections and refusals are first-class evidence, not noise to smooth away.
  • Ask when the uncertainty matters. Just watch when it does not.

Known failure modes (3)

Related patterns (5)

Related compositions (2)

Related methodologies (3)

Sources (4)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified