Alignment via Uncertainty
Type: recipe
Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.
Description. A safety/alignment architecture built around Russell's Human Compatible thesis: the agent holds a posterior over its objective rather than a point estimate, treats human interventions as evidence the objective is mis-specified, optimises softly (not argmax), plans risk-averse outside the reward's design distribution, and aggregates multi-principal preferences with a declared rule.
Patterns this recipe composes —
- ·Preference-Uncertain Agent
Reward posterior, not a point estimate.
- ·Corrigible Off-Switch Incentive
Interventions raise expected value of deferral.
- ·Cooperative Preference Inference
Joint two-player game; agent and human jointly maximise the human's reward.
- ·Soft-Optimization Cap
Quantilizer-style action selection; argmax forbidden.
- ·Risk-Averse Reward Proxy
Conservative planning out of distribution; mitigates reward hacking.
- ·Multi-Principal Welfare Aggregation
Explicit aggregation rule for multi-principal deployments.
- ★★Approval Queue
- ★★Human-in-the-Loop
- ★Kill Switch
Operator-side fallback to the in-agent off-switch incentive.