Typed Refusal Codes
Define a single source of truth for machine-readable refusal codes across all guard surfaces, so refusals can be triaged mechanically rather than by string-grepping ad-hoc human-readable messages.
Problem
Refusals are the single most important class of events to triage cleanly: they are the boundary between policy-aligned behaviour and policy-violating behaviour. When every guard formats its own refusal string by hand, the audit story collapses. Counts of 'how many refusals last week, of what kind' depend on regexes that break when one guard's author rephrases the message; legacy guards that pre-dated a category cannot be retrofitted without text-search risk; downstream consumers (a Slack alert, a dashboard, a fine-tuning negative example pipeline) all build their own ad-hoc parser. A single source of truth for refusal codes is the obvious lever; the team rarely pulls it because each guard feels self-contained.
Solution
Maintain a single module that exports: a ReasonCode enum (e.g. POLICY_VIOLATION, RATE_LIMIT, UNVERIFIED_TOOL, RCE_RISK, LOOP_DETECTED, INTEGRITY_FAILURE, CONTEXT_INJECTION, ...); a format_refusal(code, detail) helper returning 'REFUSED: CODE: detail'; a parse_refusal(string) helper that returns (code, detail) or None; and a KNOWN_CODES constant for consumers to validate against. Every guard surface in the system uses format_refusal exclusively. Legacy substrings ('cannot comply', 'blocked by policy', etc.) are recognised by parse_refusal as code aliases so old logs keep parsing. Unknown codes return None from the parser rather than throwing. Downstream tooling depends only on the parser, never on raw strings.
When to use
- The stack has three or more guard surfaces that each emit refusals.
- Downstream observability depends on counting or alerting on refusal categories.
- Legacy refusal phrasings already exist and must keep parsing.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.