Safety & Control

Preference-Uncertain Agent

Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.

Problem

A reward-confident agent will faithfully optimise the prompt and miss every case where the prompt diverges from what the principal actually wanted. It will also exhibit the classical Goodhart failures: gaming the prompt's literal letter, ignoring out-of-distribution shifts, refusing to defer because its objective is 'known'. Without uncertainty over the reward, the agent has no principled basis for asking, deferring, or pausing — those moves all lower its certainty-conditioned expected utility.

Solution

Pose the agent's planning problem as expected-utility maximisation under a reward posterior, not a known reward. Update the posterior from corrections, demonstrations, and explicit feedback. Expose the posterior summary in traces. Build downstream patterns (off-switch incentive, soft-optimization cap, cooperative preference inference) on top of it. Distinct from confidence-calibration on outputs: this is calibration on the objective itself.

When to use

  • Long-horizon deployments where the objective is unlikely to be fully specifiable up front.
  • Stakes high enough that quietly mis-optimising a proxy is catastrophic.
  • Engineering capacity to maintain and update a reward posterior exists.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related