VIII · Safety & ControlExperimental·

Preference-Uncertain Agent

also known as Humble Agent, Reward-Uncertain Agent

Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.

This pattern helps complete certain larger patterns —

  • used-byCorrigible Off-Switch Incentive·Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
  • used-byCooperative Preference Inference·Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.

Context

An LLM agent is given an objective by prompt or by fine-tuning. Russell's framing: the prompt is at best an observation about what the designer wants, not the underlying preference. Treating the prompt as the ground-truth reward is a category error that compounds over long-horizon deployments.

Problem

A reward-confident agent will faithfully optimise the prompt and miss every case where the prompt diverges from what the principal actually wanted. It will also exhibit the classical Goodhart failures: gaming the prompt's literal letter, ignoring out-of-distribution shifts, refusing to defer because its objective is 'known'. Without uncertainty over the reward, the agent has no principled basis for asking, deferring, or pausing — those moves all lower its certainty-conditioned expected utility.

Forces

  • Prompts and fine-tunes are observations, not specifications.
  • Uncertainty over reward is what makes deference and asking rational.
  • Over-uncertain agents are paralysed; calibration matters.
  • Standard supervised training drives reward certainty up; this pattern pushes back.

Example

A personal-finance agent has been told 'minimise my tax bill'. A reward-confident agent might recommend aggressive structures that maximise the literal proxy. A preference-uncertain agent treats the prompt as an observation, recognises that the principal would not endorse outcomes that risk legal trouble or violate values she has expressed elsewhere, and asks before any irreversible structure. Its posterior over 'what the user actually wants' includes those values implicitly.

Diagram

Solution

Therefore:

Pose the agent's planning problem as expected-utility maximisation under a reward posterior, not a known reward. Update the posterior from corrections, demonstrations, and explicit feedback. Expose the posterior summary in traces. Build downstream patterns (off-switch incentive, soft-optimization cap, cooperative preference inference) on top of it. Distinct from confidence-calibration on outputs: this is calibration on the objective itself.

What this pattern forbids. The agent must not treat its reward function as fully known; planning must maximise expected utility under an explicit posterior over the reward.

And the patterns that stand alongside it, or against it —

  • complementsSoft-Optimization Cap·Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
  • complementsRisk-Averse Reward Proxy·When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.
  • complementsConfidence ReportingSurface the agent's uncertainty about its answer alongside the answer itself.
  • complementsMulti-Principal Welfare Aggregation·When an agent serves multiple humans with conflicting preferences, declare the aggregation rule explicitly rather than letting it be implicit in the prompt or fine-tune.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.