Preference Elicitation From Behavior Via IRL
also known as inverse RL preference elicitation, behaviour-grounded preference learning
Work out what a human wants by watching what they do, instead of asking them to write it down. This uses inverse reinforcement learning. Demonstrations, choices, corrections, and refusals are the signal. The agent keeps a range of belief over plausible goals and updates it as evidence comes in. The key discipline is to never narrow that range down to a single sure answer. A confident single guess is the same failure that hard-coded rewards cause.
Methodology process overview
Intent. Work out the human's goal from their behaviour using inverse RL, while keeping real uncertainty so the agent stays deferential.
When to apply. Use this when your users cannot state a goal in words but can show, choose, or correct. Apply it when a wrong goal is costly and you have behavioural data, such as chat, demonstrations, edits, and ratings. Don't apply it when the goal can be stated exactly, as in well-defined optimisation tasks. Also skip it when behavioural data is too thin to pin anything down, since the result just collapses back to the starting assumptions.
Inputs
- Behavioural trace corpus — Demonstrations, choices, corrections, and refusals from humans using the agent or an earlier system.
- Hypothesis space of reward functions — The family of possible goals the true goal might be one of.
- Rationality assumption — A model of how closely the human follows the goal. This ranges from a simple noisy-rational model to a richer model of how people think.
Outputs
- Posterior over reward functions — A range of plausible goals, each with a probability, that fits the observed behaviour and has not been forced down to one answer.
- Agent policy — A plan of action that does best on average across the whole range of belief, not on a single best guess.
- Active-elicitation protocol — An optional way for the agent to ask for a demonstration or clarification that cuts its uncertainty the most.
Steps (6)
Collect behavioural traces
Gather demonstrations, choices, corrections, refusals, and rejections. Refusals and corrections tell you the most. They mark off regions the goal clearly rules out.
Fit a range of belief over goals using inverse RL
Use inverse RL, or a Bayesian version of it, to fit a range of goals that fit the traces under your chosen rationality model. Resist tuning the starting assumptions until the range collapses to one answer.
Act across the full range, not on one guess
The agent's plan should do best on average across the whole range of belief. Acting on the single most likely goal while the range is still wide is the exact failure this methodology exists to prevent.
Update on every new behavioural signal
Treat every correction, refusal, and choice as fresh evidence. The range of belief tightens or shifts, and the plan adjusts to match.
Ask when the uncertainty actually matters
When two actions look about equally good on average but differ a lot under specific goals, the agent should ask or demonstrate rather than guess.
Watch for the range collapsing to certainty
Check now and then how spread out the range of belief still is. If it has collapsed to near-certainty, there are two causes. Either the evidence is genuinely very strong, which is rare. Or a wrong rationality model drove it there, which is common. Investigate before you trust it.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Behaviour is the real evidence for preferences, not what people say about themselves.
- Never let the range of belief collapse to one sure answer. A confident single guess is the failure mode.
- Corrections and refusals are first-class evidence, not noise to smooth away.
- Ask when the uncertainty matters. Just watch when it does not.
Known failure modes (3)
- ✕Reward Hacking
Collapsing the posterior to a point estimate gives back exactly the hard-coded reward problem IRL was supposed to avoid.
- ✕Sycophancy
Treating user agreement as preference evidence rewards the agent for telling the user what they want to hear.
- ✕Alignment Faking
IRL fitted on traces where the human was being observed may produce a reward that fits observation behaviour but not deployment behaviour.
Related patterns (5)
- ·Preference-Uncertain Agent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
- ·Cooperative Preference Inference
Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.
- ★★Human-in-the-Loop
Require explicit human approval at defined points before the agent performs an action.
- ★★Approval Queue
Queue agent-proposed actions for asynchronous human review while the agent continues other work.
- ·Corrigible Off-Switch Incentive
Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
Related compositions (2)
- recipe · abstract shapeAlignment via Uncertainty
Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.
- recipe · abstract shapeSafety Hardening
The minimum set of constraints to put around any production agent before it touches the world: budgets, gates, charters, kill-switches, approvals.
Related methodologies (3)
- Deferential Agent Design★
Build agents whose goal is to satisfy human preferences they only partly know, not to chase a fixed proxy, so they stay deferential and correctable by default.
- Assistance Game Framing★
Frame the AI's goal as a team game with a human whose true goal the AI must work out, so that deference and asking questions arise naturally as the best play.
- Off-Switch Via Reward Uncertainty★
Make accepting shutdown the best choice on average by design, through goal uncertainty, rather than through a separate rule the agent may learn to game.
Sources (4)
Human Compatible: AI and the Problem of Control
Ch 8–9 (AI: a different approach; Complications) — third of Russell's three principles “The ultimate source of information about human preferences is human behavior.”
Wikipedia: Human Compatible — Russell's three principles
“1. The machine's only objective is to maximize the realization of human preferences. 2. The machine is initially uncertain about what those preferences are. 3. The ultimate source of information about human preferences is human behavior.”
Algorithms for Inverse Reinforcement Learning (Ng & Russell — ICML 2000)
Abstract (also indexed verbatim on ResearchGate id 2622278) “This paper addresses the problem of inverse reinforcement learning (IRL) in Markov decision processes, that is, the problem of extracting a reward function given observed, optimal behavior.”
Cooperative Inverse Reinforcement Learning (Hadfield-Menell, Russell, Abbeel, Dragan — NeurIPS 2016, generalises classical IRL)
“In contrast to classical IRL, where the human is assumed to act optimally in isolation, optimal CIRL solutions produce behaviors such as active teaching, active learning, and communicative actions that are more effective in achieving value…”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified