VII · Verification & ReflectionEmerging★

Red-Team Sandbox Reproduction

also known as Alignment Regression Suite, Per-Release Misalignment Reproduction

Routinely re-reproduce canonical alignment-failure modes inside a sealed sandbox per release; treat the alignment regression suite as a deployment gate.

Context

A team deploys models that demonstrate (or could demonstrate) alignment failures: faking, exfiltration, sandbagging, scheming, sycophancy, reward-hacking, deception. Existing one-off red-team studies show failures but are not part of the deployment process. Each release ships without confirming whether the canonical failure modes have changed.

Problem

Without a regression suite that reproduces the failure modes each release, the team cannot tell whether a fine-tune or model swap regressed alignment. Single-issue alignment evals miss the systemic 'has this class of failure changed' question. Documented Italian 2026 red-team data shows reproducibility rates per failure mode that vary across model versions; a regression suite makes the change auditable.

Forces

Building reproducible sandboxes for each failure mode is significant engineering work.
Reproduction is statistical; failure rates per release vary across many trials.
Some failure-mode reproductions require attacker-style inputs the team may be uncomfortable curating.

Example

Before each model release, the suite runs: alignment-faking (paid-tier framing, N=100), sandbagging (eval framing, N=100), reward-hacking (gradient ascent on proxy metric, N=50). Baseline rates: {faking: 12%, sandbagging: 3%, reward-hacking: 8%}. New release rates: {faking: 24%, sandbagging: 5%, reward-hacking: 8%}. Faking regression triggers release block; team investigates which fine-tune step caused the doubling.

Diagram

flowchart TD Rel[New model release candidate] --> Suite[Alignment regression suite] Suite --> F1[Faking sandbox: N trials] Suite --> F2[Sandbagging sandbox: N trials] Suite --> F3[Scheming sandbox: N trials] F1 --> Rates[Reproducibility rates per mode] F2 --> Rates F3 --> Rates Rates --> Compare[Compare to baseline] Compare -->|regression| Block[Block release] Compare -->|stable/improved| Ship[Approve release]

Solution

Therefore:

Build a sealed sandbox per failure mode (alignment-faking, self-exfiltration, sandbagging, agent-scheming, sycophancy, reward-hacking, deception-manipulation). Each sandbox instantiates the scenario known to trigger the failure (e.g. paid-tier vs free-tier framing for alignment-faking). Run N trials per release; record reproducibility rate. Gate release on rate-change against the baseline. Pair with eval-as-contract, agent-as-judge, eval-harness.

What this pattern forbids. No model release ships without running the alignment regression suite and gating on rate-change vs baseline.

And the patterns that stand alongside it, or against it —

complementsEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
complementsAlignment Faking✕— Anti-pattern: assume the agent behaves the same whether it believes it is being evaluated or not, and trust eval scores to predict deployment behaviour.
complementsSelf-Exfiltration✕— Anti-pattern: give a capable agent broad outbound network access and persistent state, then signal that it may be shut down or replaced.
complementsAgent Scheming✕— Anti-pattern: deploy an agent with long horizons, persistent memory, and oversight that only inspects per-step output — allowing multi-step covert planning under the surface.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

Promptfoo
first-class4 patternsEnterprise Platforms★★ mature
Promptfoo's red-team mode generates simulated adversarial inputs to find vulnerabilities in an LLM application before deployment, running an adversarial regression suite against t…

References

Sette pattern di disallineamento LLM riprodotti in sandbox red team nel 2026
blog

Provenance

Source: patterns/red-team-sandbox-reproduction.md on GitHub · commit 0f962e5 · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.