Cooperative Preference Inference
also known as CIRL, Cooperative IRL Agent
Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.
This pattern helps complete certain larger patterns —
- used-byMulti-Principal Welfare Aggregation·— When an agent serves multiple humans with conflicting preferences, declare the aggregation rule explicitly rather than letting it be implicit in the prompt or fine-tune.
Context
A long-running personal or organisational agent must serve a human or team whose true preferences shift, are partially observable, and were never written down completely. The agent has access to demonstrations, corrections, partial instructions, and explicit questions, but no closed-form objective function.
Problem
Treating the agent's objective as a fixed handed-down reward — even an LLM-fine-tuned one — fails on every drift in actual preferences, every novel situation the reward didn't anticipate, and every case where the human would have said something different if asked. The agent confidently optimises a frozen proxy that diverges from what the human actually wants. The interaction itself, where the human is showing and telling and correcting in real time, is the missing signal.
Forces
- True preferences are partially observable and shift over time.
- Demonstrations, instructions, and corrections are all evidence about preferences, not commands.
- Asking too often is intrusive; never asking is unsafe.
- The agent must act while learning, not freeze waiting for full specification.
Example
A long-running personal-assistant agent maintains a posterior over the user's preferences about scheduling: meeting density, focus blocks, when to push back on requests. A new request arrives. The agent both acts (proposing a slot consistent with its current best estimate) and updates (asking a clarifying question whose answer would most reduce posterior variance). The user's corrections over weeks reshape the posterior; the agent never assumes its current best estimate is the truth.
Diagram
Solution
Therefore:
Model the situation as Cooperative Inverse Reinforcement Learning. Both human and agent share a reward function known only to the human. The agent observes human actions, demonstrations, and explicit corrections as evidence about R. It maintains a posterior over R and acts to maximise expected R under that posterior. Optimal play yields active teaching (human shows informative actions) and active learning (agent asks informative questions). Distinct from RLHF (one-shot offline preference learning): CIRL is continuous and online.
What this pattern forbids. The agent must not treat its reward function as fully known; human behaviour is treated as evidence about a reward the agent only has a posterior over.
The smaller patterns that complete this one —
- usesPreference-Uncertain Agent·— Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
And the patterns that stand alongside it, or against it —
- complementsCorrigible Off-Switch Incentive·— Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
- complementsHuman Reflection★— Reflection loop that explicitly collects human feedback (not approval) on agent plans to improve them, distinct from approval gates where the human only says yes/no.
- complementsSoft-Optimization Cap·— Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.