Canonical-Entity Grounding

also known as Master-Data Lookup Grounding, Authoritative Identifier Resolution, Entity-Resolve-Before-Act

Require the agent to resolve every business identifier it uses — SKU, account, supplier, customer — through an authoritative lookup against the system of record, rather than emitting the identifier from the model's parametric memory.

Context

An agent acts over enterprise systems whose entities are identified by exact codes — general-ledger accounts, stock-keeping units, supplier numbers, customer ids, project codes — that carry no meaning the model could infer and that must match a record exactly to be valid. The model is fluent enough to produce strings that look like these codes, and a code that is plausible but wrong points at the wrong account or the wrong part. The authoritative values live in master data the model was never trained on and that changes after training.

Problem

Asked for an identifier it does not have, a model will supply one from parametric memory that is well-formed and confidently wrong — a close-enough part number, an account code from a similar company, a supplier id that no longer exists. In an enterprise system there is no credit for close: a transaction posted to a plausible-but-wrong GL account is a real error, not an approximation. Because the fabricated identifier is syntactically valid, downstream format validation often accepts it, and the mistake surfaces only later as a misposting or a failed integration.

Forces

The model is good at mapping a vague reference such as 'the Berlin office supplier' to intent, but bad at producing the exact code that intent corresponds to.
Master data is authoritative and current; the model's parametric knowledge of identifiers is neither.
A fabricated identifier is often well-formed, so format validation passes and the error escapes.
Calling a lookup on every identifier costs latency and tool calls; skipping it risks silent corruption of a system of record.
Identifiers change after the model is trained, so even a once-correct memorised code drifts out of date.

Example

An agent processes supplier invoices into an ERP. A scanned invoice reads 'Müller Bürobedarf, Munich', and the agent's first instinct is to fill in a supplier number that looks right from similar vendors. Instead it calls a master-data resolver with the name and address; the resolver returns the one canonical supplier id on file, or reports no confident match. When a new vendor has no record, the agent stops and flags it for onboarding rather than posting to a guessed account. Only resolved, real identifiers ever reach the posting step.

Diagram

sequenceDiagram participant A as Agent (LLM) participant R as Master-data resolver (tool) participant M as Master data / system of record A->>R: reference (description or candidate id) R->>M: lookup alt confident match M-->>R: canonical id R-->>A: canonical_id + confidence A->>M: action uses canonical_id else no match / low confidence R-->>A: no-match A->>A: clarify or halt (do not guess) end

Solution

Therefore:

Give the agent a resolver tool over master data that takes a description or a candidate identifier and returns the canonical id with a confidence, or an explicit no-match. Require every identifier that will enter an action — especially a write — to pass through this resolver first; the model proposes intent ('post to the marketing-travel account for the Munich entity') and the resolver returns the exact code, rather than the model emitting the code directly. Treat the resolver's output, not the model's text, as the identifier of record. On a no-match or a low-confidence result the agent asks for clarification or halts rather than guessing. Once an identifier is resolved, the rest of the operation runs against that canonical id deterministically. Where volume is high, the resolver can present a fetched candidate set the model selects among, so the model chooses among real entities rather than inventing one.

What it gives you

Writes to the system of record can only name entities that exist, removing a whole class of confident-but-wrong errors.
The boundary between fuzzy intent (the model) and exact identity (master data) is explicit and testable.
Resolution against current master data tolerates identifiers that changed after the model was trained.
A no-match becomes a visible clarification or halt instead of a silent misposting found weeks later.

What it costs you

A lookup on every identifier adds latency and tool calls, which matters in high-volume batch work.
The resolver itself can return the wrong entity when the description is ambiguous or master data is dirty.
Building and maintaining a resolver over messy master data is real integration work.
Over-eager resolution can mask a genuine data-entry problem a human should have seen.

What this pattern forbids. The agent must not use a self-generated identifier in an action against the system of record: every GL account, SKU, supplier, or customer id must be confirmed by the resolver tool first, and on a no-match or low-confidence result the agent must not guess but must clarify or halt.

And the patterns that stand alongside it, or against it —

complementsAgentic RAG★★— Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
complementsCRAG★— Add a lightweight retrieval evaluator that grades each retrieved document and triggers corrective web search on poor retrievals.
complementsChain of Verification★— Reduce hallucination by drafting an answer, generating independent verification questions, answering them in isolation, and revising.
complementsJSON-Only Action Schema✕— Anti-pattern: restrict the agent's action language to JSON tool-call dictionaries even for tasks where code-as-action (functions composing, loops, conditionals over results) would be the natural shape.
complementsCitation Attribution★★— Track and surface, alongside a RAG-grounded answer, which retrieved chunks supported which claims, so the binding between answer span and source survives all the way to the user.
complementsTenant-Scoped Tool Binding★— Bind every tool call and retrieval to the active tenant in code at the execution layer, so a multi-tenant agent can never be talked into reading or writing another tenant's data.
complementsRisk-Tiered Action Autonomy★— Set an agent's permitted action class by the financial materiality of the action, letting it read and draft freely while requiring a different human principal to release material postings, payments, or filings.
complementsAffordance Grounding Before Action·— Have a vision-language model ground each candidate action against the current scene and predict its affordance, so that actions the environment cannot physically support are discarded before any reach the controller.
complementsVerify-Before-Cite Resolution Gate★— After generation, resolve every cited authority against an external ground-truth registry and strip or block any citation that does not exist before the answer reaches the reader.
complementsTable-Augmented Generation·— Answer a natural-language question over a database in three stages — synthesise an executable query, run it against the data layer with model calls embedded in execution, then generate the answer from the result.
complementsSemantic-Layer Query Guardrail★— Route natural-language data questions through a curated semantic layer so the model selects and parameterises vetted metrics and dimensions instead of free-authoring raw SQL against production data.