Safety & Control

Soft-Optimization Cap

Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.

Problem

Aggressive optimisation pushes the agent toward action regions where the objective and the true preference diverge most. The 0.001-quantile of action-space (the extreme argmax tail) is the region most likely to contain degenerate maxima the designer never anticipated. Capping how hard the agent optimises trades a little expected score against a large amount of safety from specification gaming.

Solution

Following Taylor's quantilizers: define a base distribution over actions (the agent's prior over reasonable moves). To pick an action, sample from the top q-quantile of that distribution ranked by the inferred objective. The classic bound: a q-quantilizer's expected cost under any bounded utility is at most 1/q times the cost of the base distribution. In practice for LLM agents: take top-k sampling on the planner, or set a satisficing threshold and accept the first action that clears it. Cap is a tuned parameter, not optimisation.

When to use

  • The agent's inferred objective is plausibly mis-specified at the tail.
  • A reasonable base distribution of human-endorsed actions exists.
  • Some loss of expected score is acceptable in exchange for tail safety.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related