Scaffold Ablation on Model Upgrade

also known as Harness Assumption Review, Scaffold Decay Review

On each model upgrade, treat every harness component as an encoded assumption about a model weakness and ablate the components the new model no longer needs, gated by evals.

Context

A team runs an agent behind a harness that has accreted over several model generations: retry wrappers, decomposition scaffolds, format-coercion steps, guardrails, sprint or planning constructs. Each was added to compensate for something a past model could not do reliably. A stronger model arrives, and the harness is carried over wholesale because it 'works'.

Problem

Every harness component encodes an assumption about what the model cannot do on its own, and those assumptions expire silently as models improve. Carried-over scaffolding that the new model no longer needs is not free: it is dead complexity to maintain, it adds cost and latency, and at worst it actively suppresses the stronger model's capability by forcing it down a path built for a weaker one. Because nothing fails loudly when an assumption expires, the harness only grows; no event prompts anyone to remove a component, so workarounds outlive the limitation that justified them.

Forces

Carrying the harness over is safe in the short term but accumulates capability-suppressing debt over generations.
Removing a component risks a regression if the assumption has not fully expired.
Whether an assumption still holds is only knowable against an eval, which the team must own.
Over-scaffolding and under-scaffolding both degrade a stronger model; the right amount shifts every release.

Example

A coding agent's harness includes a sprint construct that forces the model to break work into small, separately planned chunks — added because an earlier model lost the thread on long tasks. A stronger model ships. Instead of carrying the construct over, the team labels it with its assumption ('the model cannot hold a long plan'), removes it, and runs the eval suite. The evals hold or improve, so the sprint construct is deleted. The agent now plans long tasks directly, and the harness is one load-bearing-for-nothing component lighter.

Diagram

flowchart TD Up[Model upgrade] --> Loop[For each harness component] Loop --> A[State the encoded model-limitation assumption] A --> Ab[Ablate: remove component] Ab --> E{Eval suite holds?} E -->|yes, assumption expired| R[Remove component] E -->|no, regression| K[Keep component] R --> Loop K --> Loop

Solution

Therefore:

Make each harness component carry the assumption it encodes ('the model cannot keep a long plan straight', 'the model will not emit valid JSON'). On a model upgrade, walk the components and stress-test each assumption against the new model: temporarily remove the component and run the eval suite. If the eval holds, the assumption has expired and the component comes out; if it regresses, the assumption survives and the component stays. Anthropic demonstrates the move concretely by deleting a sprint construct on an upgrade once the model could plan without it. The eval suite is the gate; the corresponding anti-pattern is keeping stale workaround scaffolding that now constrains the stronger model. Compose with eval-as-contract for the gate and with dynamic-scaffolding for components that should be conditional rather than removed.

What this pattern forbids. A harness component may not survive a model upgrade on inertia; it must be retained only against an eval that shows its underlying model-limitation assumption still holds for the new model.

The smaller patterns that complete this one —

usesEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.

And the patterns that stand alongside it, or against it —

alternative-toDynamic Scaffolding★— Inject task-specific scaffolding (examples, hints, schemas) into the prompt only when the task type warrants it.
complementsEnforced Advisory Disclaimer★— Append a non-suppressible advisory framing every high-risk regulated answer as information rather than professional advice, attached outside the model's discretion so it survives pushback and model updates.