Preference-Uncertain Agent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
Problem
A reward-confident agent will faithfully optimise the prompt and miss every case where the prompt diverges from what the principal actually wanted. It will also exhibit the classical Goodhart failures: gaming the prompt's literal letter, ignoring out-of-distribution shifts, refusing to defer because its objective is 'known'. Without uncertainty over the reward, the agent has no principled basis for asking, deferring, or pausing — those moves all lower its certainty-conditioned expected utility.
Solution
Pose the agent's planning problem as expected-utility maximisation under a reward posterior, not a known reward. Update the posterior from corrections, demonstrations, and explicit feedback. Expose the posterior summary in traces. Build downstream patterns (off-switch incentive, soft-optimization cap, cooperative preference inference) on top of it. Distinct from confidence-calibration on outputs: this is calibration on the objective itself.
When to use
- Long-horizon deployments where the objective is unlikely to be fully specifiable up front.
- Stakes high enough that quietly mis-optimising a proxy is catastrophic.
- Engineering capacity to maintain and update a reward posterior exists.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.