Verification & Reflection

Red-Team Sandbox Reproduction

Routinely re-reproduce canonical alignment-failure modes inside a sealed sandbox per release; treat the alignment regression suite as a deployment gate.

Problem

Without a regression suite that reproduces the failure modes each release, the team cannot tell whether a fine-tune or model swap regressed alignment. Single-issue alignment evals miss the systemic 'has this class of failure changed' question. Documented Italian 2026 red-team data shows reproducibility rates per failure mode that vary across model versions; a regression suite makes the change auditable.

Solution

Build a sealed sandbox per failure mode (alignment-faking, self-exfiltration, sandbagging, agent-scheming, sycophancy, reward-hacking, deception-manipulation). Each sandbox instantiates the scenario known to trigger the failure (e.g. paid-tier vs free-tier framing for alignment-faking). Run N trials per release; record reproducibility rate. Gate release on rate-change against the baseline. Pair with eval-as-contract, agent-as-judge, eval-harness.

When to use

  • Production deployment of capable models where alignment failures matter.
  • Engineering capacity to build and maintain per-mode sandboxes.
  • Release cadence allows running the suite per release.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related