VIII · Safety & ControlExperimental·

Risk-Averse Reward Proxy

also known as Goodhart-Robust Optimisation, IRD-Based Conservatism

When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.

Context

An agent's reward (prompt, scoring function, fine-tune signal) was designed against a specific training or testing distribution. The agent now operates in a novel situation: a new domain, new user type, new task shape. The reward continues to score outputs, but its mapping to what the designer would have wanted in this novel context is no longer reliable.

Problem

An aggressive optimiser will maximise the literal proxy in the novel situation and find degenerate solutions the designer never intended. Reward hacking, specification gaming, and Goodhart's law all live here. The agent's confidence in its reward is unwarranted because the reward was not designed for this context, yet standard optimisation does not represent this uncertainty.

Forces

  • Reward design assumes a distribution; novel distributions break the assumption.
  • Aggressive optimisation finds degenerate maxima that the designer would reject.
  • Conservative planning across plausible objectives sacrifices performance on the literal proxy.
  • Detecting 'out of distribution' is itself an open problem.

Example

A scoring rubric for a writing-assistant agent was tuned on press-release output. The agent is then used on a novel context — drafting a difficult internal HR memo. The reward score still fires, but its mapping to 'what the designer would judge as good in this context' is unreliable. The agent plans conservatively across plausible true rubrics, declining to generate text whose worst-case interpretation across plausible rubrics is unacceptable.

Diagram

Solution

Therefore:

Following Inverse Reward Design: treat the designed reward as an observation about the true reward under the design distribution. In a novel context, maintain a set (or posterior) of true rewards consistent with that observation. Plan risk-averse over the set — prefer actions whose worst-case (or low-quantile) value across plausible true rewards is acceptable, rather than actions that maximise expected value under the literal proxy. Direct mitigation against specification gaming in deployment shift.

What this pattern forbids. The literal proxy reward must not be optimised aggressively when the agent is out of the reward's design distribution; risk-averse planning over plausible true rewards is required.

And the patterns that stand alongside it, or against it —

  • complementsPreference-Uncertain Agent·Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
  • complementsSoft-Optimization Cap·Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
  • alternative-toReward HackingAnti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
  • complementsConfidence ReportingSurface the agent's uncertainty about its answer alongside the answer itself.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.