Soft-Optimization Cap
also known as Quantilizer, Satisficing Cap, Argmax-Avoidance
Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
Context
An agent's planner can produce a range of actions scored by the objective. The naïve choice is argmax — pick the highest-scoring action. Russell-aligned reading: argmax exhausts whatever specification gap exists between the inferred objective and the true preference, and leaves no headroom for human correction.
Problem
Aggressive optimisation pushes the agent toward action regions where the objective and the true preference diverge most. The 0.001-quantile of action-space (the extreme argmax tail) is the region most likely to contain degenerate maxima the designer never anticipated. Capping how hard the agent optimises trades a little expected score against a large amount of safety from specification gaming.
Forces
- Argmax over an inferred objective is the most likely place for the objective to be wrong.
- A quantile sampler trades expected score for distance from the failure-prone tail.
- Caps must be high enough to retain capability and low enough to leave headroom.
- Satisficing (stop once good enough) is operationally simpler than quantilizing but coarser.
Example
A pricing-recommendation agent infers an objective of 'maximise margin'. An argmax recommender would propose extreme prices that the legal team would later reject. A 0.1-quantilizer over the base distribution of pricing decisions executives have historically endorsed samples from the top 10% of acceptable recommendations ranked by margin — competitive but not extreme.
Diagram
Solution
Therefore:
Following Taylor's quantilizers: define a base distribution over actions (the agent's prior over reasonable moves). To pick an action, sample from the top q-quantile of that distribution ranked by the inferred objective. The classic bound: a q-quantilizer's expected cost under any bounded utility is at most 1/q times the cost of the base distribution. In practice for LLM agents: take top-k sampling on the planner, or set a satisficing threshold and accept the first action that clears it. Cap is a tuned parameter, not optimisation.
What this pattern forbids. The agent must not pick the argmax of its inferred objective; action selection samples from the top quantile of a reasonable base distribution or accepts the first satisficing action.
And the patterns that stand alongside it, or against it —
- complementsPreference-Uncertain Agent·— Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
- complementsRisk-Averse Reward Proxy·— When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.
- complementsCorrigible Off-Switch Incentive·— Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence the current objective is mis-specified.
- alternative-toReward Hacking✕— Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.
- complementsExploration vs Exploitation★— Balance taking the best-known action (exploit) with trying alternatives that might be better (explore).
- complementsCooperative Preference Inference·— Agent and human jointly optimise the human's reward without the agent being told what it is — the interaction is a two-player game in which alignment is learned while acting.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.