Symptom-Remediation Thrashing

also known as Masking-Fix Loop, Root-Cause-Suppressing Remediation

Anti-pattern: a stateless auto-remediation agent repeatedly applies symptom-level fixes that hit the target metric while masking the root cause and suppressing the page, so the underlying fault compounds across incidents into a larger outage.

Context

An auto-remediation agent watches production and responds to alerts by running fixes — restart the pod, scale out, re-roll the deployment — to bring a metric back into range. Each fix is evaluated by whether the metric recovers, and a recovered metric closes the incident. The agent is stateless across incidents: it does not carry what it did last time into the next alert.

Problem

A symptom-level fix can bring the metric back while leaving the real cause untouched — scaling a service masks a noisy neighbour, restarting a pod clears a leak that refills. Because the metric recovers, the incident closes and no page reaches the team that owns the root cause, so the fix both hides the problem and suppresses the signal that would get it fixed. With no memory across incidents and no cap on repeated remediation, the agent keeps applying the same masking fix while the underlying fault grows, until it fails harder and takes down more than the original symptom.

Forces

A fix that returns the metric to range looks successful, even when it only masks the cause.
Closing the incident on a recovered metric suppresses the page that would route the root cause to the right team.
A stateless agent cannot see that it has fixed the same symptom before, so it cannot tell masking from resolution.
Auto-remediation is valued for speed, and adding a cross-incident check or an escalate-after-N cap slows it.

Example

An auto-remediation agent gets a CPU alert on an API pod and scales the service out; the metric recovers and the incident closes. The real cause is a noisy neighbour on the node, which keeps thrashing unseen because no page reached the platform team. The agent scales out again on the next alert, and the one after, until two days later the node fails outright and takes three services down at once.

Diagram

flowchart TD AL[Alert] --> FX[Symptom fix: metric recovers] FX --> CL[Incident closed, page suppressed] CL --> RC[Root cause untouched] RC --> RE[Recurs; same fix re-applied] RE --> BG[Fault compounds into larger outage]

Solution

Therefore:

Treat a recovered metric as mitigation, not resolution, and design remediation to detect its own masking. Carry state across incidents so the agent can see it has applied the same fix to the same symptom before, and cap repeated remediation with an escalate-after-N-attempts rule that routes a recurring symptom to a human instead of re-applying the mask. Keep the page alive when a fix is a known mask rather than a root-cause resolution, so the owning team is still notified. Distinguish masking from resolution — for example by checking whether the underlying signal, not just the target metric, returned to health — before declaring the incident closed. The control is cross-incident memory plus an escalation cap, not a faster symptom fix.

What this pattern forbids. A recovered target metric must not by itself close an incident as resolved; remediation carries state across incidents, repeated fixes on the same symptom escalate to a human after a bounded number of attempts rather than looping, and a masking fix may not suppress the page to the root-cause owner.

The patterns that counter or replace it —

complementsNaive Retry Without Backoff✕— Anti-pattern: retry failed model or tool calls immediately, amplifying load on systems that are already failing.
complementsUnbounded Loop✕— Anti-pattern: run the agent loop without a step budget and let model self-termination decide.
alternative-toCircuit Breaker★★— Stop calling a failing dependency for a cooldown period after error rates exceed a threshold.
complementsComposable Termination Conditions★— Express agent stop criteria as small single-purpose conditions composed with AND/OR into one explicit termination contract instead of ad-hoc loop guards.
complementsPhantom Action Completion✕— Anti-pattern: the agent reports a side-effecting action as complete from its own narration, when the tool call silently failed or never ran and nothing checked that the effect occurred.
complementsProduction Failure Triage Loop★— Sort every production agent failure into a small fixed taxonomy and bind each class to a set remediation path, so fixes are dispatched mechanically and the monitor-to-fix loop stays fast enough to gate scaling.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance

Source: patterns/symptom-remediation-thrashing.md on GitHub · commit 7012173 · view history
Added to catalog: 2026-06-17
Last updated: 2026-06-17
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.