Governance & Observability

Determinism-Tiered Replay Gate

Classify an agent into a reproducibility tier by re-running identical inputs, require the strictest decision-determinism tier for regulated decisions, and gate deployment and validation-sample size on the measured tier.

Problem

Replay being mechanically possible does not mean a re-run converges. Sampling temperature, tool-ordering races, clock and retrieval drift, and model-version changes all let two runs of identical inputs diverge, and the divergence may stop at the reasoning trace or reach the decision itself. Treating all agents as equally reproducible lets one whose final decision flips on re-run pass the same governance bar as one that is bit-for-bit stable, so a regulator who replays a logged case can get a different answer than the customer received, with no prior signal that this was possible.

Solution

Define an ordered ladder of reproducibility tiers measured by paired re-runs on identical inputs: trace determinism (the same tool sequence and arguments), action determinism (the same tool sequence with arguments allowed to vary within tolerance), and decision determinism (the same final decision regardless of path). A determinism harness re-runs a held-out sample of logged cases, compares each re-run against the original at every tier, and reports the strictest tier the agent satisfies above a confidence threshold. A gate maps the measured tier to a release decision: regulated decisions are admitted only at the decision-determinism tier or stricter, advisory or internal uses may ship at weaker tiers, and the required validation-sample size and monitoring frequency scale inversely with the tier so a weaker agent must clear a larger sample and a tighter drift watch. The measured tier is recorded as a re-certifiable assurance artifact and re-measured on every model or tool change, since either can silently drop the agent to a lower tier.

When to use

  • An agent makes decisions a regulator or auditor may later replay and expect the same conclusion.
  • Replay machinery exists but the agent's actual reproducibility under re-run has never been measured.
  • Different uses warrant different reproducibility bars, so a single deployment gate cannot apply one rule to all of them.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related