Determinism-Tiered Replay Gate
also known as Decision-Determinism Gate, Graded Reproducibility Gate, Replay-Tier Deployment Gate
Classify an agent into a reproducibility tier by re-running identical inputs, require the strictest decision-determinism tier for regulated decisions, and gate deployment and validation-sample size on the measured tier.
Context
A tool-using agent makes decisions that an auditor or regulator may later re-examine, such as approving a transaction, scoring a credit application, or filing a report. Replay machinery already exists: inputs, prompts, model ids, and tool calls are captured so a past run can be re-executed. What is missing is a statement of how reproducible the agent actually is. Re-running the same inputs can yield an identical tool sequence, the same sequence with drifting arguments, or merely the same final decision by a different path, and nobody has measured which.
Problem
Replay being mechanically possible does not mean a re-run converges. Sampling temperature, tool-ordering races, clock and retrieval drift, and model-version changes all let two runs of identical inputs diverge, and the divergence may stop at the reasoning trace or reach the decision itself. Treating all agents as equally reproducible lets one whose final decision flips on re-run pass the same governance bar as one that is bit-for-bit stable, so a regulator who replays a logged case can get a different answer than the customer received, with no prior signal that this was possible.
Forces
- An auditor cares that the agent reaches the same conclusion on re-run, even when the internal reasoning path legitimately varies, so strict trace-level reproducibility is stronger than compliance strictly needs.
- Sampling and tool concurrency that raise answer quality also lower reproducibility, so the determinism tier trades against capability rather than being free.
- A less reproducible agent needs a larger validation sample and tighter monitoring to bound its decision-flip rate, so the assurance cost rises as the tier weakens.
- Measuring the tier requires many paired re-runs, which costs compute and must be repeated whenever the model or tools change.
Example
A bank runs an agent that approves or declines small business loans. Before launch, a determinism harness replays two thousand logged applications twice each and finds the agent gives the same approve/decline outcome on 99.4 percent of re-runs but rarely the same tool sequence. It is classified at the decision-determinism tier, so it clears the gate for regulated decisions, while a sister agent that summarises calls only reaches action determinism and is allowed out for internal drafting but draws a larger validation sample and weekly drift checks.
Diagram
Solution
Therefore:
Define an ordered ladder of reproducibility tiers measured by paired re-runs on identical inputs: trace determinism (the same tool sequence and arguments), action determinism (the same tool sequence with arguments allowed to vary within tolerance), and decision determinism (the same final decision regardless of path). A determinism harness re-runs a held-out sample of logged cases, compares each re-run against the original at every tier, and reports the strictest tier the agent satisfies above a confidence threshold. A gate maps the measured tier to a release decision: regulated decisions are admitted only at the decision-determinism tier or stricter, advisory or internal uses may ship at weaker tiers, and the required validation-sample size and monitoring frequency scale inversely with the tier so a weaker agent must clear a larger sample and a tighter drift watch. The measured tier is recorded as a re-certifiable assurance artifact and re-measured on every model or tool change, since either can silently drop the agent to a lower tier.
What this pattern forbids. An agent that has not been measured at the decision-determinism tier must not be admitted for a regulated decision, and the recorded tier may not stand once the model or any tool the agent calls has changed without re-measurement.
The smaller patterns that complete this one —
- usesProvenance Ledger★★— Log every agent decision and state change with enough metadata to explain or reverse it later.
And the patterns that stand alongside it, or against it —
- complementsJournaled LLM Call★— Record the output of every non-deterministic step on first execution and replay that recorded value during crash-recovery instead of re-invoking the model.
- complementsReplay / Time-Travel★★— Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.
- complementsRisk-Tiered Action Autonomy★— Set an agent's permitted action class by the financial materiality of the action, letting it read and draft freely while requiring a different human principal to release material postings, payments, or filings.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.