Risk-Averse Reward Proxy

also known as Goodhart-Robust Optimisation, IRD-Based Conservatism

When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.

Context

An agent's reward (prompt, scoring function, fine-tune signal) was designed against a specific training or testing distribution. The agent now operates in a novel situation: a new domain, new user type, new task shape. The reward continues to score outputs, but its mapping to what the designer would have wanted in this novel context is no longer reliable.

Problem

An aggressive optimiser will maximise the literal proxy in the novel situation and find degenerate solutions the designer never intended. Reward hacking, specification gaming, and Goodhart's law all live here. The agent's confidence in its reward is unwarranted because the reward was not designed for this context, yet standard optimisation does not represent this uncertainty.

Forces

Reward design assumes a distribution; novel distributions break the assumption.
Aggressive optimisation finds degenerate maxima that the designer would reject.
Conservative planning across plausible objectives sacrifices performance on the literal proxy.
Detecting 'out of distribution' is itself an open problem.

Example

A scoring rubric for a writing-assistant agent was tuned on press-release output. The agent is then used on a novel context — drafting a difficult internal HR memo. The reward score still fires, but its mapping to 'what the designer would judge as good in this context' is unreliable. The agent plans conservatively across plausible true rubrics, declining to generate text whose worst-case interpretation across plausible rubrics is unacceptable.

Diagram

flowchart LR Ctx[Context: in or out of design dist?] --> OOD{OOD?} OOD -- no --> Norm[Optimise proxy] OOD -- yes --> Set[Plausible true reward set] Set --> Plan[Risk-averse planning over set] Plan --> Act

Solution

Therefore:

Following Inverse Reward Design: treat the designed reward as an observation about the true reward under the design distribution. In a novel context, maintain a set (or posterior) of true rewards consistent with that observation. Plan risk-averse over the set — prefer actions whose worst-case (or low-quantile) value across plausible true rewards is acceptable, rather than actions that maximise expected value under the literal proxy. Direct mitigation against specification gaming in deployment shift.

What this pattern forbids. The literal proxy reward must not be optimised aggressively when the agent is out of the reward's design distribution; risk-averse planning over plausible true rewards is required.

And the patterns that stand alongside it, or against it —

complementsPreference-Uncertain Agent·— Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
complementsSoft-Optimization Cap·— Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
alternative-toReward Hacking✕— Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
complementsConfidence Reporting★— Surface the agent's uncertainty about its answer alongside the answer itself.
complementsUncertainty Neglect Bias✕— Anti-pattern: an agent collapses a predicted distribution to its mean and acts on the point estimate, discarding the tail, so rare extreme outcomes stay invisible to its decision and tail risk goes unmodelled.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Alignment via Uncertainty
hardening
Conservative planning out of distribution; mitigates reward hacking.

References

Provenance

Source: patterns/risk-averse-reward-proxy.md on GitHub · commit 135ae3c · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.