Observability Fail-Open
Anti-pattern: build an agent's monitoring and diagnostic tools to fail open, so when telemetry is blocked, denied, or missing they return a default healthy reading and operators see green while the system is broken.
Problem
When a monitoring or diagnostic tool cannot run, the system must decide what its absence means, and the convenient default is to treat no signal as a good signal. A denied or blocked probe returns an empty or default response that downstream logic reads as healthy, so the dashboard stays green precisely when observability is most degraded. The failure is doubly hidden: the underlying problem produces no alert, and the loss of the monitoring capability itself produces no alert. Operators and the agent both act on a false-normal reading, an incident runs unwatched, and a later forensic investigation finds that the diagnostic tools were silently denied and reported nothing wrong rather than reporting that they could not see.
Solution
Treat the monitoring path as something that can fail and must announce its own failure. Distinguish three states explicitly — healthy, unhealthy, and unknown — and never collapse unknown into healthy: a denied, blocked, timed-out, or missing probe yields unknown, which is itself an alertable, degraded condition. Monitor the monitors by emitting a heartbeat for every diagnostic capability, so the loss of a probe pages on-call just as a failing service would. Where a control depends on a signal, fail safe on its absence — hold autoscaling, freeze irreversible actions, or escalate rather than proceed on a false-normal. In forensic review, record which diagnostics were available and which were denied, so a silent blind spot cannot be mistaken for a clean bill of health.
When to use
- Recognising this risk when agent health, alerting, or control decisions depend on telemetry and diagnostic tools that can themselves be blocked or denied.
- Reviewing a monitoring stack that returns defaults or empty responses when a probe cannot run.
- Investigating an incident that ran without alerts and finding the diagnostics had silently stopped reporting.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.