Methodology · Safety & Alignment

Preference Elicitation From Behavior Via IRL

Work out the human's goal from their behaviour using inverse RL, while keeping real uncertainty so the agent stays deferential.

Description

Work out what a human wants by watching what they do, instead of asking them to write it down. This uses inverse reinforcement learning. Demonstrations, choices, corrections, and refusals are the signal. The agent keeps a range of belief over plausible goals and updates it as evidence comes in. The key discipline is to never narrow that range down to a single sure answer. A confident single guess is the same failure that hard-coded rewards cause.

When to apply

Use this when your users cannot state a goal in words but can show, choose, or correct. Apply it when a wrong goal is costly and you have behavioural data, such as chat, demonstrations, edits, and ratings. Don't apply it when the goal can be stated exactly, as in well-defined optimisation tasks. Also skip it when behavioural data is too thin to pin anything down, since the result just collapses back to the starting assumptions.

What it involves

Collect behavioural traces
Fit a range of belief over goals using inverse RL
Act across the full range, not on one guess
Update on every new behavioural signal
Ask when the uncertainty actually matters
Watch for the range collapsing to certainty

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related