Methodology · Safety & Alignmentemergingverified

Preference Elicitation From Behavior Via IRL

also known as inverse RL preference elicitation, behaviour-grounded preference learning

Applies to: agentautonomous-agentmulti-agent-system

Tags: irlpreference-learningbehaviour-groundedposterior-uncertainty

Work out what a human wants by watching what they do, instead of asking them to write it down. This uses inverse reinforcement learning. Demonstrations, choices, corrections, and refusals are the signal. The agent keeps a range of belief over plausible goals and updates it as evidence comes in. The key discipline is to never narrow that range down to a single sure answer. A confident single guess is the same failure that hard-coded rewards cause.

Methodology process overview

flowchart LR traces[Behavioural traces:\ndemos, choices, refusals] --> fit[Fit posterior P(R)\nvia Bayesian IRL] rat[Rationality model:\nBoltzmann] --> fit hyp[Reward hypothesis space] --> fit fit --> post[Posterior over R] post --> pol[Policy = argmax E[R | posterior]] pol --> act[Agent action] act --> sig[New behavioural signal] sig -->|update| fit post --> ent{Posterior entropy} ent -->|collapsed| audit[Audit: bug or genuine?] ent -->|broad and close call| ask[Active elicitation] ask --> sig audit --> fit

Intent. Work out the human's goal from their behaviour using inverse RL, while keeping real uncertainty so the agent stays deferential.

When to apply. Use this when your users cannot state a goal in words but can show, choose, or correct. Apply it when a wrong goal is costly and you have behavioural data, such as chat, demonstrations, edits, and ratings. Don't apply it when the goal can be stated exactly, as in well-defined optimisation tasks. Also skip it when behavioural data is too thin to pin anything down, since the result just collapses back to the starting assumptions.

Example scenario

A research team building an email-assistant agent tests whether inverse RL can recover a user's preferences from editing behaviour, rather than asking the user to write a rubric. The corpus is 12,000 traces from a single power user. It contains drafts the assistant proposed, the edits the user made, the drafts the user rejected outright, and the drafts the user sent unchanged. The team fits a Bayesian inverse RL model over a set of possible goals. These include 'prefers concise replies', 'prefers warm tone', 'prefers no exclamation marks', and combinations of them. They use a Boltzmann-rational model of the user. The range of belief stays broad after a few hundred traces, since several goals fit the data. The agent's behaviour reflects that. It asks the user to confirm tone on borderline drafts. It acts on its own only when one choice is clearly better across the whole range. After two more weeks of traces, a refusal pattern emerges that sharply rules out 'prefers warm tone'. The range shifts toward 'concise plus neutral'. The team watches how spread out the range is and notices it collapse after a faulty rationality update. They investigate, find the bug, widen the rationality assumption, and re-fit. They report honestly that the approach works in this single-user research setting. It has not been tested across many users or run on production traffic. Inverse RL on observed behaviour stays a partial solution to the elicitation problem.

Inputs

Behavioural trace corpus — Demonstrations, choices, corrections, and refusals from humans using the agent or an earlier system.
Hypothesis space of reward functions — The family of possible goals the true goal might be one of.
Rationality assumption — A model of how closely the human follows the goal. This ranges from a simple noisy-rational model to a richer model of how people think.

Outputs

Posterior over reward functions — A range of plausible goals, each with a probability, that fits the observed behaviour and has not been forced down to one answer.
Agent policy — A plan of action that does best on average across the whole range of belief, not on a single best guess.
Active-elicitation protocol — An optional way for the agent to ask for a demonstration or clarification that cuts its uncertainty the most.

Steps (6)

Collect behavioural traces
Gather demonstrations, choices, corrections, refusals, and rejections. Refusals and corrections tell you the most. They mark off regions the goal clearly rules out.
Fit a range of belief over goals using inverse RL
Use inverse RL, or a Bayesian version of it, to fit a range of goals that fit the traces under your chosen rationality model. Resist tuning the starting assumptions until the range collapses to one answer.
Act across the full range, not on one guess
The agent's plan should do best on average across the whole range of belief. Acting on the single most likely goal while the range is still wide is the exact failure this methodology exists to prevent.
usesPreference-Uncertain Agent
Update on every new behavioural signal
Treat every correction, refusal, and choice as fresh evidence. The range of belief tightens or shifts, and the plan adjusts to match.
usesCooperative Preference Inference
Ask when the uncertainty actually matters
When two actions look about equally good on average but differ a lot under specific goals, the agent should ask or demonstrate rather than guess.
usesHuman-in-the-Loop Approval Queue
Watch for the range collapsing to certainty
Check now and then how spread out the range of belief still is. If it has collapsed to near-certainty, there are two causes. Either the evidence is genuinely very strong, which is rare. Or a wrong rationality model drove it there, which is common. Investigate before you trust it.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Behaviour is the real evidence for preferences, not what people say about themselves.
Never let the range of belief collapse to one sure answer. A confident single guess is the failure mode.
Corrections and refusals are first-class evidence, not noise to smooth away.
Ask when the uncertainty matters. Just watch when it does not.

Preference Elicitation From Behavior Via IRL

Methodology process overview

Steps (6)

Collect behavioural traces

Fit a range of belief over goals using inverse RL

Act across the full range, not on one guess

Update on every new behavioural signal

Ask when the uncertainty actually matters

Watch for the range collapsing to certainty

Framework-specific instructions

Principles

Known failure modes (3)

Related patterns (5)

Related compositions (2)

Related methodologies (3)

Sources (4)

Provenance