Contextual Retrieval

also known as Chunk Contextualisation, Anthropic Contextual Embeddings

Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.

This pattern helps complete certain larger patterns —

specialisesNaive RAG★★— Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.
used-byCitation Attribution★★— Track and surface, alongside a RAG-grounded answer, which retrieved chunks supported which claims, so the binding between answer span and source survives all the way to the user.

Context

A team is using a retrieval-augmented system over a corpus that has been split into small chunks for embedding and indexing. Many of those chunks lose surrounding context at the split boundary: pronouns like 'they' or 'it' no longer have an antecedent in the chunk, references like 'the company' or 'that quarter' drop their referent, and time references become ambiguous. The embeddings of these decontextualised chunks land far from queries that name the entity or time period explicitly.

Problem

When a user query names an entity by its full name and the corpus chunk that contains the answer only refers to that entity by pronoun, vector search finds the chunk distant and misses it. A naive chunk-and-embed pipeline therefore destroys exactly the context it most needs to preserve, and recall on otherwise-easy queries collapses. The chunks need to carry enough surrounding context that their embeddings stay close to the queries that should retrieve them, without inflating the corpus so much that indexing and retrieval cost become unaffordable.

Forces

An LLM call per chunk is expensive.
Prompt caching of the parent document amortises the cost.
Context generation must be deterministic enough to keep the index stable.

Example

A 200-page company handbook is split into 600 chunks for retrieval. One chunk says 'the deadline is the 15th of the following month' — but a query for 'invoice deadline' won't match because the chunk doesn't say 'invoice'. Contextual Retrieval prepends a one-sentence context to each chunk: 'This chunk discusses invoice payment timing.' Now the embedding carries the context the original chunk lost when it was split.

Diagram

flowchart TD Doc[Parent document] --> P[LLM: situate chunk] Chunk[Chunk] --> P P --> Pre[Short description] Pre --> J[Prepend to chunk] Chunk --> J J --> E[Embed] J --> BM[Index BM25]

Solution

Therefore:

For each chunk, prompt an LLM with the parent document and the chunk; receive a short description that situates the chunk. Prepend that description to the chunk. Embed the prepended chunk. Store BM25 over both prepended chunks (Contextual BM25) and dense vectors (Contextual Embeddings). Compose with reranking for further gains.

What this pattern forbids. Chunks enter the index only after contextualisation; raw chunks are not indexed.

The smaller patterns that complete this one —

usesPrompt Caching★★— Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.

And the patterns that stand alongside it, or against it —

composes-withHybrid Search★★— Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
composes-withCross-Encoder Reranking★★— After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
alternative-toRAFT★— Train the model to ignore irrelevant retrieved documents (distractors) in a domain-specific RAG setting.
alternative-toMemory Poisoning✕— Anti-pattern: write to agent long-term memory (vector store, knowledge graph, episodic log) from any surface the agent reads, with no provenance check.
composes-withHierarchical Retrieval★★— Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.
complementsInformation Chunking for Agent Memory★★— Structure inputs into digestible topical segments (chunks) before feeding to short-term memory rather than throwing the full input at the model; reduces overload and increases accuracy (~40% improvement observed in customer-service deployment).

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Production RAG
core

Used in frameworks

References

Introducing Contextual Retrieval
blog

Provenance

Source: patterns/contextual-retrieval.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.