Observability Fail-Open

also known as Forensic Blind Spot, Fail-Open Telemetry, False-Normal Monitoring

Anti-pattern: build an agent's monitoring and diagnostic tools to fail open, so when telemetry is blocked, denied, or missing they return a default healthy reading and operators see green while the system is broken.

Context

An agent runs in production under continuous monitoring — health checks, telemetry probes, diagnostic tool calls, anomaly scorers — that report whether it is operating normally. These signals gate alerts, autoscaling, rollback, and on-call response. Like any tool the agent or platform calls, the monitoring tools can themselves fail: a probe is sandboxed away, a permission is denied, an endpoint is unreachable, or a log sink drops writes.

Problem

When a monitoring or diagnostic tool cannot run, the system must decide what its absence means, and the convenient default is to treat no signal as a good signal. A denied or blocked probe returns an empty or default response that downstream logic reads as healthy, so the dashboard stays green precisely when observability is most degraded. The failure is doubly hidden: the underlying problem produces no alert, and the loss of the monitoring capability itself produces no alert. Operators and the agent both act on a false-normal reading, an incident runs unwatched, and a later forensic investigation finds that the diagnostic tools were silently denied and reported nothing wrong rather than reporting that they could not see.

Forces

Failing open keeps the agent running when a probe breaks, which is why monitoring is so often built that way.
Absence of a signal is ambiguous — it can mean healthy or it can mean the signal was lost — and the cheap default conflates the two.
A monitor that fails closed risks halting the system or paging on its own outage, so teams bias toward failing open.
Loss of observability is itself rarely monitored, so a blocked diagnostic tool leaves no trace that it went blind.

Example

A platform team gates rollback on an anomaly-scorer tool the agent calls each minute. A permissions change quietly blocks the scorer, and instead of erroring it returns an empty result that the rollback logic treats as no anomalies. For two hours a regression ships traffic errors while every dashboard reads green, and the postmortem discovers the scorer had been denied the whole time and never said so.

Diagram

flowchart TD P[Diagnostic / telemetry tool blocked or denied] --> R[Returns default / empty response] R --> H[Downstream reads it as healthy] H --> G[Dashboard green while system is broken] P -.fail loud + UNKNOWN state.-> A[Loss of monitoring itself raises an alert]

Solution

Therefore:

Treat the monitoring path as something that can fail and must announce its own failure. Distinguish three states explicitly — healthy, unhealthy, and unknown — and never collapse unknown into healthy: a denied, blocked, timed-out, or missing probe yields unknown, which is itself an alertable, degraded condition. Monitor the monitors by emitting a heartbeat for every diagnostic capability, so the loss of a probe pages on-call just as a failing service would. Where a control depends on a signal, fail safe on its absence — hold autoscaling, freeze irreversible actions, or escalate rather than proceed on a false-normal. In forensic review, record which diagnostics were available and which were denied, so a silent blind spot cannot be mistaken for a clean bill of health.

What this pattern forbids. A missing, denied, or unreachable monitoring signal must not default to healthy; diagnostic tools cannot collapse could-not-observe into no-problem-detected, and loss of a monitoring capability must itself raise an alert before the system is treated as well.

The patterns that counter or replace it —

complementsAgent Confession as Forensics✕— Anti-pattern: after an agent-caused incident, the team treats the agent's confabulated self-narrative as the forensic record and root cause, even though the self-report is generated rather than remembered and can be flatly wrong.
complementsErrors Swept Under the Rug✕— Anti-pattern: scrub failed actions, stack traces, and error observations from the agent's own context so the trace looks clean, leaving the model with no evidence of what did not work.
complementsAdversary-Indistinguishability Blind Spot✕— Anti-pattern: rely on behavioral-anomaly detection calibrated to irregular human behaviour, so an autonomous adversary acting with legitimate credentials, standard protocols, and superhuman consistency is less anomalous than a human and slips past unseen.
complementsAgent Output Alert Fatigue✕— Anti-pattern: an agent emits high-volume, low-precision findings that progressively desensitise its human reviewers until they mute it, so even its correct findings stop landing and the human-oversight control silently disappears.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance

Source: patterns/observability-fail-open.md on GitHub · commit e18c94c · view history
Added to catalog: 2026-06-19
Last updated: 2026-06-19
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.