Retrieval & RAG
Knowledge from outside parameters.
17 patterns in this book. · Updated
When to reach for each
01. Agentic RAG Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query. Best for: A single retrieve-then-generate pass is insufficient for the task's information needs. Tradeoff: Cost and latency rise with loop iterations. Watch for: Static one-shot RAG already meets quality targets at lower cost and latency.
02. Hybrid Search Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results. Best for: Queries mix semantic intent with rare tokens (codes, IDs, proper nouns) that embeddings miss. Tradeoff: Two indexes to keep in sync. Watch for: The corpus is uniformly conceptual; dense alone is enough.
03. Naive RAG Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters. Best for: Knowledge lives outside the model and must be conditioned on at query time. Tradeoff: Chunk boundaries destroy context. Watch for: The needed knowledge is already in a tool, database, or scoped system prompt (see naive-rag-first).
04. Cross-Encoder Reranking After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate). Best for: Initial retrieval returns a noisy top-100 and accuracy of top-5 matters. Tradeoff: Latency adds one call per candidate. Watch for: Latency target is sub-100ms end-to-end; cross-encoders blow it.
05. GraphRAG Build an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries. Best for: Users ask global, corpus-wide questions that local chunk retrieval cannot answer. Tradeoff: High indexing cost (orders of magnitude more LLM calls). Watch for: Queries are narrowly local and naive RAG already serves them well.
All patterns in this book
Agentic RAG
×49Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
Hybrid Search
×9Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
Naive RAG
×7Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.
Cross-Encoder Reranking
×5After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
GraphRAG
×5Build an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries.
Contextual Retrieval
×4Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
Query Rewriting
×3Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.
HippoRAG
×3Build an LLM-extracted schemaless knowledge graph from the corpus and run Personalized PageRank seeded on the query's key concepts so multi-hop retrieval completes in a single pass.
Citation Attribution
×2Track and surface, alongside a RAG-grounded answer, which retrieved chunks supported which claims, so the binding between answer span and source survives all the way to the user.
CRAG
×2Add a lightweight retrieval evaluator that grades each retrieved document and triggers corrective web search on poor retrievals.
CDC-Driven Vector Sync
×1Treat the source-of-truth document store as the only writer; keep the vector index in sync by emitting change-data-capture events onto a queue that the feature pipeline consumes.
HyDE
×1Have the LLM write a hypothetical answer document, embed it, and use it as the retrieval query.
Modular RAG
×1Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per que…
RAFT
×1Train the model to ignore irrelevant retrieved documents (distractors) in a domain-specific RAG setting.
Self-RAG
×1Fine-tune the model to emit reflection tokens that decide when to retrieve, evaluate retrieved relevance, and assess generated support.
Streaming Feature Pipeline
×1Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
Hierarchical Retrieval
Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that c…