Self-Corpus Vocabulary

also known as Personal-Concept Lexicon, Own-Writing Lexicon

Mine a small bounded vocabulary from the agent's own writing and cache it as the conceptual axis for scoring new thoughts, so relevance reflects the agent's actual frame rather than a generic embedding space.

Context

A long-running agent accumulates a corpus of its own output: thought traces, insights, journal entries, notes. Some downstream component wants to score new thoughts for relevance, novelty, or kinship with the agent's existing concerns. The default tool is a generic embedding space, which gives a sensible answer about semantic similarity but tells the agent nothing about its own preoccupations — 'is the agent still pulling at the things it has been pulling at?' is a different question from 'is this semantically close to the previous paragraph?'

Problem

Generic embeddings score against the world's distribution of meaning, not the agent's. A new thought that lands inside the agent's persistent web of concerns can come back with the same similarity score as a perfectly off-topic but topically-adjacent one, because the embedding space has no notion of what this particular agent has been writing about for months. The result is a salience signal that is plausible-on-paper and indifferent in practice: the agent cannot tell, from the score alone, whether a thought is on its own line of inquiry or just somewhere in the same neighbourhood.

Forces

The agent's own corpus is the only source that knows its frame.
Vocabularies that grow unbounded become a different problem (everything matches).
The vocabulary must refresh as the agent's frame shifts.
Mining must be cheap or it cannot run on a schedule.
Storage must survive across sessions, like the corpus it derives from.

Example

An agent has been journalling for three months. Once a week, a mining job aggregates frontmatter tags and high-frequency content tokens across recent thoughts and the long-term insight store, picks the top thirty concepts with weights, and writes them to a small JSON cache. When the agent receives a new thought, the salience scorer combines generic embedding distance to recent context with overlap against the cached vocabulary. A thought that uses three of the top-thirty concepts scores higher than a thought with similar embedding distance but no overlap, because the cached vocabulary says 'this is on the line of inquiry'.

Diagram

flowchart LR Corpus[(Own corpus<br/>thoughts + insights)] Miner[Mining pass<br/>tags + content frequency] Cache[(Top-N vocabulary<br/>cache)] Thought[New thought] Scorer[Salience scorer] Corpus -->|periodic| Miner Miner --> Cache Thought --> Scorer Cache --> Scorer Scorer --> Score[Own-frame score]

Solution

Therefore:

Run a periodic mining pass over the agent's own corpus (e.g. last N weeks of thoughts plus the long-term insight store). Aggregate frontmatter tags and content frequency to extract the top-N concept tokens with weights. Persist this vocabulary as a small JSON cache. Downstream scoring components consume the cache as an additional axis: a thought is scored both on generic embedding similarity to recent context and on overlap with the cached self-vocabulary. Refresh on a cadence proportional to corpus volatility (e.g. weekly for a stable agent, after every dream-consolidation cycle for a more volatile one).

What this pattern forbids. Scoring components cannot use only the generic embedding space for own-frame relevance; the agent's learned vocabulary must be available as a separate axis so generic similarity does not displace own-frame fit.

And the patterns that stand alongside it, or against it —

complementsVector Memory★★— Store memories as embeddings in a vector index and retrieve the most semantically similar items at query time.
complementsCluster-Capped Insight Store·— Cap the number of insights per stem-token cluster and archive the oldest variants by mtime so the long-term store keeps the active research edge instead of accumulating near-duplicates.
complementsSalience Attention Mechanism★— Score every candidate memory item with a weighted salience function so each tick attends to a small, relevant top-k subset rather than re-reading all memory.
complementsDream Consolidation Cycle★— Run a deeper, slower reflection pass distinct from per-tick reflection — reading hours of recent thoughts, promoting themes, releasing affective residue, and clearing working memory — so the agent does not accumulate residue indefinitely.
complementsSemantic Memory★— Maintain a dedicated store of what the agent holds to be true about the user and the world, separate from event records (episodic) and learned how-to (procedural).

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

Sparrot
first-class75 patternsDomain Agents· experimental
A bounded concept vocabulary (~30 tags) is mined by clustering her own thoughts and insights and cached as the conceptual axis used to score how well a new thought fits what she a…

References

Provenance

Source: patterns/self-corpus-vocabulary.md on GitHub · commit ad774f5 · view history
Added to catalog: 2026-05-21
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.