Cognition & Introspection

Cooperative Preference Inference

Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.

Problem

Treating the agent's objective as a fixed handed-down reward — even an LLM-fine-tuned one — fails on every drift in actual preferences, every novel situation the reward didn't anticipate, and every case where the human would have said something different if asked. The agent confidently optimises a frozen proxy that diverges from what the human actually wants. The interaction itself, where the human is showing and telling and correcting in real time, is the missing signal.

Solution

Model the situation as Cooperative Inverse Reinforcement Learning. Both human and agent share a reward function known only to the human. The agent observes human actions, demonstrations, and explicit corrections as evidence about R. It maintains a posterior over R and acts to maximise expected R under that posterior. Optimal play yields active teaching (human shows informative actions) and active learning (agent asks informative questions). Distinct from RLHF (one-shot offline preference learning): CIRL is continuous and online.

When to use

  • Long-running deployment where preferences shift and were never fully specified.
  • The agent has access to corrections, demonstrations, and questions as ongoing signal.
  • Building principled uncertainty into the agent's objective is worth the engineering cost.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related