Recipe · Recipes

Alignment via Uncertainty

Type: recipe

Compose a corrigible, preference-uncertain agent from the named building blocks rather than relying on a single fine-tune to encode alignment.

Description. A safety/alignment architecture built around Russell's Human Compatible thesis: the agent holds a posterior over its objective rather than a point estimate, treats human interventions as evidence the objective is mis-specified, optimises softly (not argmax), plans risk-averse outside the reward's design distribution, and aggregates multi-principal preferences with a declared rule.

Patterns this recipe composes

Provenance

  • Last updated: