← All booksBook IV

Retrieval & RAG

Knowledge from outside parameters.

17 patterns in this book. · Updated

↓ download as png

When to reach for each

01. Agentic RAG Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query. Best for: A single retrieve-then-generate pass is insufficient for the task's information needs. Tradeoff: Cost and latency rise with loop iterations. Watch for: Static one-shot RAG already meets quality targets at lower cost and latency.

02. Hybrid Search Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results. Best for: Queries mix semantic intent with rare tokens (codes, IDs, proper nouns) that embeddings miss. Tradeoff: Two indexes to keep in sync. Watch for: The corpus is uniformly conceptual; dense alone is enough.

03. Naive RAG Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters. Best for: Knowledge lives outside the model and must be conditioned on at query time. Tradeoff: Chunk boundaries destroy context. Watch for: The needed knowledge is already in a tool, database, or scoped system prompt (see naive-rag-first).

04. Cross-Encoder Reranking After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate). Best for: Initial retrieval returns a noisy top-100 and accuracy of top-5 matters. Tradeoff: Latency adds one call per candidate. Watch for: Latency target is sub-100ms end-to-end; cross-encoders blow it.

05. GraphRAG Build an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries. Best for: Users ask global, corpus-wide questions that local chunk retrieval cannot answer. Tradeoff: High indexing cost (orders of magnitude more LLM calls). Watch for: Queries are narrowly local and naive RAG already serves them well.

All patterns in this book

Agentic RAG

×49

Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.

Hybrid Search

×9

Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.

Naive RAG

×7

Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.

Cross-Encoder Reranking

×5

After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).

GraphRAG

×5

Build an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries.

Contextual Retrieval

×4

Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.

Query Rewriting

×3

Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.

HippoRAG

×3

Build an LLM-extracted schemaless knowledge graph from the corpus and run Personalized PageRank seeded on the query's key concepts so multi-hop retrieval completes in a single pass.

Citation Attribution

×2

Track and surface, alongside a RAG-grounded answer, which retrieved chunks supported which claims, so the binding between answer span and source survives all the way to the user.

CRAG

×2

Add a lightweight retrieval evaluator that grades each retrieved document and triggers corrective web search on poor retrievals.

CDC-Driven Vector Sync

×1

Treat the source-of-truth document store as the only writer; keep the vector index in sync by emitting change-data-capture events onto a queue that the feature pipeline consumes.

HyDE

×1

Have the LLM write a hypothetical answer document, embed it, and use it as the retrieval query.

Modular RAG

×1

Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per que…

RAFT

×1

Train the model to ignore irrelevant retrieved documents (distractors) in a domain-specific RAG setting.

Self-RAG

×1

Fine-tune the model to emit reflection tokens that decide when to retrieve, evaluate retrieved relevance, and assess generated support.

Streaming Feature Pipeline

×1

Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.

Hierarchical Retrieval

Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that c…