Alignment via Uncertainty

Type: recipe

Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.

Description. A safety/alignment architecture built around Russell's Human Compatible thesis: the agent holds a posterior over its objective rather than a point estimate, treats human interventions as evidence the objective is mis-specified, optimises softly (not argmax), plans risk-averse outside the reward's design distribution, and aggregates multi-principal preferences with a declared rule.

Patterns this recipe composes —

·Preference-Uncertain Agent
core
Reward posterior, not a point estimate.
·Corrigible Off-Switch Incentive
core
Interventions raise expected value of deferral.
·Cooperative Preference Inference
core
Joint two-player game; agent and human jointly maximise the human's reward.
·Soft-Optimization Cap
hardening
Quantilizer-style action selection; argmax forbidden.
·Risk-Averse Reward Proxy
hardening
Conservative planning out of distribution; mitigates reward hacking.
·Multi-Principal Welfare Aggregation
hardening
Explicit aggregation rule for multi-principal deployments.
★★Approval Queue
optional
★★Human-in-the-Loop
optional
★Kill Switch
optional
Operator-side fallback to the in-agent off-switch incentive.

Provenance

Last updated: 2026-05-23