Confident Inconsistency
also known as Cross-Time Output Drift, Unmeasured Non-Determinism
Anti-pattern: in a regulated workflow the same query produces materially different outputs at different times, each looking correct and passing review, so the variance stays invisible unless outputs are deliberately re-run and compared across time.
Context
An agent produces outputs that feed regulated or high-stakes decisions — a legal analysis, a compliance determination, a risk assessment — where consistency is part of correctness. The model is non-deterministic: even under settings expected to be deterministic, the same input can yield different outputs across runs. Each output is reviewed once, on its own, and if it looks correct it is accepted and actioned.
Problem
Because each individual output looks correct and passes its single review, the fact that the same query would have produced a materially different answer at another time is never seen. The inconsistency generates no error signal — nothing is malformed, nothing throws — so it is invisible unless the organisation deliberately re-runs the query and compares outputs across time. In a regulated setting this means materially different determinations are made for equivalent inputs, each defensible in isolation, with the variance surfacing only under a deliberate consistency audit that single-run review never performs.
Forces
- LLM outputs vary across runs even under deterministic settings, so the same query is not guaranteed the same answer at another time.
- Each output is reviewed in isolation and looks correct, so single-run review cannot detect that another run would differ.
- The inconsistency produces no error signal, so nothing alerts on it the way a malformed or failing output would.
- Detecting it requires deliberately re-running and comparing outputs across time, which standard review does not do.
Example
A compliance team runs the same contract clause through their agent on Monday and gets a compliant determination; an identical run three weeks later returns non-compliant, each reviewed and signed off on its own day. Nobody compared the two, because nothing errored and both looked sound. Only a later audit that re-ran a batch of past queries revealed the same inputs were getting different regulated answers over time.
Diagram
Solution
Therefore:
Make consistency a measured property, not an assumption. Re-run identical inputs and compare the outputs across time to quantify how much the same query varies, and classify the agent into a reproducibility tier from that measurement, requiring the strict tier for regulated decisions. Where determinism matters, pin it — fixed decoding, cached or replayed outputs for equivalent inputs — so the same input yields the same determination. Treat a material difference between two answers to the same query as a defect to investigate, even when each answer passes its own review, and audit consistency on a schedule rather than trusting that one good output implies a stable one. The control is cross-time comparison and a reproducibility requirement, not single-run inspection.
What this pattern forbids. Single-run review must not be treated as sufficient for a regulated output; consistency is measured by re-running identical inputs and comparing across time, a reproducibility tier appropriate to the stakes is required, and materially different answers to the same query are treated as a defect rather than accepted because each passed review.
The patterns that counter or replace it —
- alternative-toSelf-Consistency★★— Sample the same question multiple times at non-zero temperature and aggregate by majority or judge to mitigate hallucination.
- alternative-toDeterminism-Tiered Replay Gate·— Classify an agent into a reproducibility tier by re-running identical inputs, require the strictest decision-determinism tier for regulated decisions, and gate deployment and validation-sample size on the measured tier.
- complementsFalse Confidence Syndrome✕— Anti-pattern: the model produces incorrect answers with the same high confidence as correct ones, failing to vary its expressed certainty with its actual reliability — Oxford-documented for constraint-heavy prompts.
- complementsConfidence Reporting★— Surface the agent's uncertainty about its answer alongside the answer itself.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.