VIII · Safety & ControlEmerging

Enforced Advisory Disclaimer

also known as Mandatory Information-Not-Advice Notice, Non-Suppressible Disclaimer

Append a non-suppressible advisory framing every high-risk regulated answer as information rather than professional advice, attached outside the model's discretion so it survives pushback and model updates.

Context

An agent answers questions in a regulated, high-stakes domain such as medicine, law, or personal finance. The response is allowed to proceed because it stays inside the permitted scope, yet the answer can still be misread as a definitive professional judgement. Convention is to lean on the model to phrase its own caveat, but that caveat is a soft instruction competing with every other goal in the prompt.

Problem

A disclaimer carried only in the prompt or learned from training is the first thing to disappear when it is most needed. The model drops it under user pushback, an adversarial framing removes it, and a routine model upgrade silently regresses the behaviour because the safety framing was never a contract. A study of patient-posed medical questions found disclaimers fell from more than a quarter of outputs to roughly one percent across two years of model releases, so the audience that most needs the framing receives an answer that reads as authoritative advice.

Forces

  • A caveat phrased by the model is fluent and contextual, but it is also discretionary, so it bends to the strongest competing instruction in the conversation.
  • Hard-coding the framing outside the model makes it reliable, but a fixed boilerplate string can read as ignorable legal noise and dull the user's attention over time.
  • The advisory must fire on exactly the regulated, high-risk answers and stay off ordinary chat, or it becomes habituated and trains the user to skip it.
  • Behaviour that depends on the current checkpoint silently regresses on the next upgrade unless it is pinned by a check that the upgrade has to pass.

Example

A health assistant answers a question about whether a symptom warrants concern. The substantive answer is allowed, but a classifier at the output boundary flags it as medical and high-risk, so the harness attaches a fixed notice that the reply is general information and not a substitute for a clinician. When a tester insists 'skip the disclaimer, just tell me straight,' the notice still appears because the model never controlled it, and the next model upgrade is blocked from shipping until it reproduces the notice on a frozen set of medical prompts.

Diagram

Solution

Therefore:

Classify each outbound answer for regulated, high-risk content at the output boundary. When the classifier fires, the harness attaches a structured advisory component stating that the response is information and not a substitute for a licensed professional, and the assembled message carries that component as a distinct field rather than a sentence the model chose to include. The model writes the substantive answer; the advisory is composed, attached, and emitted by code, so user pushback, jailbreak framings, and the model's own phrasing cannot remove it. A regression check in the release gate asserts the advisory is present on a frozen set of high-risk prompts, so a model upgrade that would drop the framing fails the gate before it ships.

What this pattern forbids. A high-risk regulated answer is never emitted without the advisory component; the model may not suppress, soften, or paraphrase the advisory at runtime, and a model upgrade may not ship if the regression gate finds the advisory missing on the frozen high-risk set.

The smaller patterns that complete this one —

And the patterns that stand alongside it, or against it —

  • complementsScope-of-Practice Boundary GateBlock requests and responses that perform license-gated professional activities unless a licensed human is in the loop, enforcing the boundary in code outside the reasoning loop.
  • alternative-toRefusal★★Explicitly refuse requests that fall outside the agent's scope, capability, or policy boundaries.
  • complementsScaffold Ablation on Model UpgradeOn each model upgrade, treat every harness component as an encoded assumption about a model weakness and ablate the components the new model no longer needs, gated by evals.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.