Recipe · Recipes

Alignment via Uncertainty

Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.

Description

A safety/alignment architecture built around Russell's Human Compatible thesis: the agent holds a posterior over its objective rather than a point estimate, treats human interventions as evidence the objective is mis-specified, optimises softly (not argmax), plans risk-averse outside the reward's design distribution, and aggregates multi-principal preferences with a declared rule.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.