VIII · Safety & ControlEmerging

Typed Refusal Codes

also known as Machine-Readable Refusal Reasons, Refusal Reason Enum

Define a single source of truth for machine-readable refusal codes across all guard surfaces, so refusals can be triaged mechanically rather than by string-grepping ad-hoc human-readable messages.

Context

A mature agent stack accumulates many guard surfaces: a tool-loop guard, a skill-scanner that refuses risky imports, a post-compaction guard that rejects suspicious context restorations, an RCE backstop, an input/output guardrail. Each was added at a different time and emits its own refusal string in a different shape. Downstream observability — logs, audits, dashboards, on-call triage — has to grep through human-readable strings to count and classify refusals, and small wording changes silently break the dashboards.

Problem

Refusals are the single most important class of events to triage cleanly: they are the boundary between policy-aligned behaviour and policy-violating behaviour. When every guard formats its own refusal string by hand, the audit story collapses. Counts of 'how many refusals last week, of what kind' depend on regexes that break when one guard's author rephrases the message; legacy guards that pre-dated a category cannot be retrofitted without text-search risk; downstream consumers (a Slack alert, a dashboard, a fine-tuning negative example pipeline) all build their own ad-hoc parser. A single source of truth for refusal codes is the obvious lever; the team rarely pulls it because each guard feels self-contained.

Forces

  • Many independent guard surfaces emit refusals; centralisation is non-trivial.
  • Codes must be machine-readable (enum-style) and human-readable in one string.
  • Legacy refusal phrasings must keep working or existing dashboards break.
  • New codes appear over time; the enum must be extensible without breaking parsers.
  • Parsing must be cheap; refusal events fire on the hot path.

Example

An agent stack has five places that can emit a refusal: a tool-loop guard, a skill-scanner that refuses risky imports, a post-compaction integrity check, an RCE backstop, and a top-level input/output guardrail. Without centralisation, each emits its own string ('I cannot help with that', 'blocked by policy', 'unsupported tool', etc.), and the dashboard parses these with brittle regex. After centralisation, every surface emits 'REFUSED: POLICY_VIOLATION: vendor block on this domain' or 'REFUSED: LOOP_DETECTED: same tool called 7x in 12s'. The dashboard groups by code, the on-call channel alerts on RCE_RISK and INTEGRITY_FAILURE, and the legacy substrings still parse because they are recognised as aliases.

Diagram

Solution

Therefore:

Maintain a single module that exports: a ReasonCode enum (e.g. POLICY_VIOLATION, RATE_LIMIT, UNVERIFIED_TOOL, RCE_RISK, LOOP_DETECTED, INTEGRITY_FAILURE, CONTEXT_INJECTION, ...); a format_refusal(code, detail) helper returning 'REFUSED: CODE: detail'; a parse_refusal(string) helper that returns (code, detail) or None; and a KNOWN_CODES constant for consumers to validate against. Every guard surface in the system uses format_refusal exclusively. Legacy substrings ('cannot comply', 'blocked by policy', etc.) are recognised by parse_refusal as code aliases so old logs keep parsing. Unknown codes return None from the parser rather than throwing. Downstream tooling depends only on the parser, never on raw strings.

What this pattern forbids. No guard surface in the stack may emit a refusal string by hand; every refusal must flow through format_refusal so the code field is machine-readable and the detail string is the only free-form portion.

And the patterns that stand alongside it, or against it —

  • complementsRefusal★★Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
  • complementsInput/Output Guardrails★★Validate inputs before they reach the model and outputs before they reach the user.
  • complementsPolicy-as-Code GateEvaluate every proposed agent action against externally-managed machine-readable policies before dispatch, so compliance authorship lives outside the prompt and outside the agent code.
  • complementsDecision Log★★Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.
  • complementsStochastic-Deterministic Boundary (SDB)Formalize the seam between an LLM proposal and a system action as a four-part contract — proposer, verifier, commit step, reject signal — so the contract itself, not the agent's good intent, gates side-effects.
  • complementsSupervisor-Plus-GateSupervisor controller that validates and gates LLM outputs against deterministic checks before they commit to side-effects.
  • complementsReflexive Metacognitive Agent·Agent maintains an explicit self-model of its own capabilities, confidence and limitations, and reasons over that model when accepting / refusing / handing off tasks.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.