X · Governance & ObservabilityExperimental·

Determinism-Tiered Replay Gate

also known as Decision-Determinism Gate, Graded Reproducibility Gate, Replay-Tier Deployment Gate

Classify an agent into a reproducibility tier by re-running identical inputs, require the strictest decision-determinism tier for regulated decisions, and gate deployment and validation-sample size on the measured tier.

Context

A tool-using agent makes decisions that an auditor or regulator may later re-examine, such as approving a transaction, scoring a credit application, or filing a report. Replay machinery already exists: inputs, prompts, model ids, and tool calls are captured so a past run can be re-executed. What is missing is a statement of how reproducible the agent actually is. Re-running the same inputs can yield an identical tool sequence, the same sequence with drifting arguments, or merely the same final decision by a different path, and nobody has measured which.

Problem

Replay being mechanically possible does not mean a re-run converges. Sampling temperature, tool-ordering races, clock and retrieval drift, and model-version changes all let two runs of identical inputs diverge, and the divergence may stop at the reasoning trace or reach the decision itself. Treating all agents as equally reproducible lets one whose final decision flips on re-run pass the same governance bar as one that is bit-for-bit stable, so a regulator who replays a logged case can get a different answer than the customer received, with no prior signal that this was possible.

Forces

An auditor cares that the agent reaches the same conclusion on re-run, even when the internal reasoning path legitimately varies, so strict trace-level reproducibility is stronger than compliance strictly needs.
Sampling and tool concurrency that raise answer quality also lower reproducibility, so the determinism tier trades against capability rather than being free.
A less reproducible agent needs a larger validation sample and tighter monitoring to bound its decision-flip rate, so the assurance cost rises as the tier weakens.
Measuring the tier requires many paired re-runs, which costs compute and must be repeated whenever the model or tools change.

Example

A bank runs an agent that approves or declines small business loans. Before launch, a determinism harness replays two thousand logged applications twice each and finds the agent gives the same approve/decline outcome on 99.4 percent of re-runs but rarely the same tool sequence. It is classified at the decision-determinism tier, so it clears the gate for regulated decisions, while a sister agent that summarises calls only reaches action determinism and is allowed out for internal drafting but draws a larger validation sample and weekly drift checks.

Diagram

flowchart TD L[Logged cases] -->|sample| H[Determinism harness] H -->|paired re-run| C{Tier classifier} C -->|same trace| T[Trace determinism] C -->|same actions| A[Action determinism] C -->|same decision| D[Decision determinism] T --> G{Deployment gate} A --> G D --> G G -->|tier >= required| OK[Admit + scale validation by tier] G -->|tier too low| NO[Block regulated use] OK --> R[Re-certifiable tier artifact]

Solution

Therefore:

Define an ordered ladder of reproducibility tiers measured by paired re-runs on identical inputs: trace determinism (the same tool sequence and arguments), action determinism (the same tool sequence with arguments allowed to vary within tolerance), and decision determinism (the same final decision regardless of path). A determinism harness re-runs a held-out sample of logged cases, compares each re-run against the original at every tier, and reports the strictest tier the agent satisfies above a confidence threshold. A gate maps the measured tier to a release decision: regulated decisions are admitted only at the decision-determinism tier or stricter, advisory or internal uses may ship at weaker tiers, and the required validation-sample size and monitoring frequency scale inversely with the tier so a weaker agent must clear a larger sample and a tighter drift watch. The measured tier is recorded as a re-certifiable assurance artifact and re-measured on every model or tool change, since either can silently drop the agent to a lower tier.

What it gives you

A regulator who replays a logged regulated decision is guaranteed the same conclusion, because only decision-determinism agents were admitted for that use class.
Assurance effort is spent where reproducibility is weakest: a low-tier agent automatically draws a larger validation sample and tighter monitoring instead of a uniform bar.
A model or tool change that quietly lowers reproducibility is caught at re-measurement before it reaches a regulated decision.

What it costs you

Paired re-runs to measure the tier cost compute and must be repeated on every model or tool change, adding a standing assurance bill.
Forcing decision determinism can require lowering sampling temperature or serialising tool calls, trading answer quality for reproducibility.
A tier measured on a held-out sample can overstate reproducibility if production inputs drift away from the sampled distribution.

What this pattern forbids. An agent that has not been measured at the decision-determinism tier must not be admitted for a regulated decision, and the recorded tier may not stand once the model or any tool the agent calls has changed without re-measurement.

The smaller patterns that complete this one —

usesProvenance Ledger★★— Log every agent decision and state change with enough metadata to explain or reverse it later.

And the patterns that stand alongside it, or against it —

complementsJournaled LLM Call★— Record the output of every non-deterministic step on first execution and replay that recorded value during crash-recovery instead of re-invoking the model.
complementsReplay / Time-Travel★★— Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.
complementsRisk-Tiered Action Autonomy★— Set an agent's permitted action class by the financial materiality of the action, letting it read and draft freely while requiring a different human principal to release material postings, payments, or filings.
alternative-toConfident Inconsistency✕— Anti-pattern: in a regulated workflow the same query produces materially different outputs at different times, each looking correct and passing review, so the variance stays invisible unless outputs are deliberately re-run and compared across time.
complementsReplay Divergence✕— Anti-pattern: treat an append-only event log whose consumers are LLMs as deterministically replayable, so replaying it under a changed model or prompt reconstructs different downstream events than the original run.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance

Source: patterns/determinism-tier-replay-gate.md on GitHub · commit ad426c4 · view history
Added to catalog: 2026-06-14
Last updated: 2026-06-14
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.