Preference Elicitation From Behavior Via IRL
Work out the human's goal from their behaviour using inverse RL, while keeping real uncertainty so the agent stays deferential.
Description
Work out what a human wants by watching what they do, instead of asking them to write it down. This uses inverse reinforcement learning. Demonstrations, choices, corrections, and refusals are the signal. The agent keeps a range of belief over plausible goals and updates it as evidence comes in. The key discipline is to never narrow that range down to a single sure answer. A confident single guess is the same failure that hard-coded rewards cause.
When to apply
Use this when your users cannot state a goal in words but can show, choose, or correct. Apply it when a wrong goal is costly and you have behavioural data, such as chat, demonstrations, edits, and ratings. Don't apply it when the goal can be stated exactly, as in well-defined optimisation tasks. Also skip it when behavioural data is too thin to pin anything down, since the result just collapses back to the starting assumptions.
What it involves
- Collect behavioural traces
- Fit a range of belief over goals using inverse RL
- Act across the full range, not on one guess
- Update on every new behavioural signal
- Ask when the uncertainty actually matters
- Watch for the range collapsing to certainty
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.