Behavior-Pinning Test Before Agent Edit
also known as Characterization Test Before Refactor, Golden-Pin Regression Gate, Pin-Before-Edit
Capture the current behaviour of agent-touchable code as golden characterization tests before an agent edits it, with load-bearing values computed deterministically and only prose left to the model, run as a regression gate.
Context
A coding or skill agent is about to refactor or extend a body of legacy code whose existing behaviour encodes business rules that nobody has written down. The rules live in the code itself: a pricing tier boundary, a rounding convention, an aggregation order. The agent reads the code, infers what it can, and rewrites it, but the inferred intent is not the same as the actual behaviour, and a clean refactor can quietly change a hidden invariant that no human notices until production.
Problem
Letting an agent edit undocumented code is risky because there is no recorded statement of what the code currently does, so a refactor has nothing to be checked against. Hand-written tests after the fact tend to encode what the model thinks the code should do, not what it did, and a suite written by the same model that wrote the change inherits the model's blind spots. The team needs a baseline that fixes the existing behaviour exactly, separates the values that must not move from the explanations that may, and fails the moment an edit drifts.
Forces
- The current behaviour is the only available specification, yet it is uncodified and an edit can silently overwrite it.
- A numeric result computed by a deterministic function is checkable to the digit, while a model-written explanation of that result is inherently variable.
- Tests authored after the edit risk codifying the edit's behaviour rather than the original; the baseline must be frozen before the agent touches anything.
- Pinning every observable detail is brittle and slows the team; pinning too little lets a load-bearing rule slip through unguarded.
Example
A team asks a coding agent to refactor a tangled invoice module that nobody fully understands. Before the agent starts, they run the module on a hundred saved invoices and freeze the resulting totals and tax figures as golden tests. The agent rewrites the module; on finish, the tests run automatically. One invoice now totals a cent less because the agent reordered a rounding step, so the gate fails and the change is rejected before it ever reaches production.
Diagram
Solution
Therefore:
Run the legacy code against representative inputs and capture its outputs as golden fixtures while the code is still untouched, so the baseline reflects what the code does rather than what anyone believes it should do. Partition each captured output: the load-bearing values, such as totals, tier boundaries, and rounding, are produced by deterministic components and asserted for exact equality, while any model-written narration that explains or annotates the result is compared loosely or excluded from the gate. Wire the suite into the agent's loop so that finishing an edit triggers the characterization tests automatically, detected from policy keywords rather than left to the agent's discretion, and treat any exact-match failure as a blocked change. Once the refactor is proven behaviour-preserving, the golden values may be updated deliberately by a human when a rule is meant to change.
What this pattern forbids. An agent edit cannot be accepted until the pre-edit characterization suite passes; the deterministic load-bearing values must match the pinned baseline exactly, and changing a golden value requires a deliberate human update rather than a model rewrite.
And the patterns that stand alongside it, or against it —
- complementsSilent Hypotheses in Generated Code✕— Anti-pattern: model-written code rests on an unstated runtime premise that passing tests and code review never surface, so the hidden assumption travels into production and fails there.
- complementsEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
- alternative-toEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- complementsDeterministic-LLM Sandwich★— Bracket every LLM call with deterministic checks on both sides.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.