IV · Retrieval & RAGMature★★

Naive RAG

also known as Retrieval-Augmented Generation, Top-K Retrieve-and-Stuff

Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.

This pattern helps complete certain larger patterns —

  • specialisesAgentic RAG★★Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
  • used-byApp Exploration Phase·Before deploying an agent against an opaque app, have it explore (or watch a human demonstrate) the app, generating a per-element documentation knowledge base; at deployment, retrieve element docs to ground actions.
  • used-byAugmented LLM★★Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.
  • used-byCitation Attribution★★Track and surface, alongside a RAG-grounded answer, which retrieved chunks supported which claims, so the binding between answer span and source survives all the way to the user.
  • specialisesModular RAGDecompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.

Context

A team needs a model to answer questions whose answers depend on information that lives in a corpus too large to fit into the prompt — internal documentation, a knowledge base, a product catalogue, recent news, a body of research papers. The corpus also changes regularly, faster than retraining the base model would allow, so any answers based on the model's training data alone will go stale or be missing entirely.

Problem

A bare language model has no access to information beyond what is baked into its weights, and any attempt to answer from parametric memory alone tends to hallucinate plausible-sounding answers, cannot cite a source, and cannot be updated without retraining. The team needs the model to pull relevant external knowledge in at query time, but doing so requires deciding how to chunk the corpus, how to index it, what to retrieve per query, and how to feed it into the prompt. Without that retrieval machinery, the model is stuck with what it already knew at training time.

Forces

  • Chunk size trades context loss for retrieval recall.
  • Embedding choice constrains retrieval quality.
  • Single-shot retrieval misses multi-hop questions.

Example

A startup ships a support assistant whose knowledge changes weekly — release notes, pricing, integration guides. Bake-it-into-the-prompt does not scale and fine-tuning on every release is impractical. They adopt naive-rag: chunk the docs, embed with a dense encoder, index, and at query time retrieve top-k and prepend to the prompt. The pipeline is the simplest possible and ships in a week. Knowledge updates now flow by re-indexing the docs, not by retraining or redeploying the model.

Diagram

Solution

Therefore:

Chunk the corpus. Embed each chunk with a dense encoder. At query time, embed the query, retrieve top-k by similarity, prepend chunks to the prompt, generate. The simplest production RAG pipeline.

What this pattern forbids. The generator may use only retrieved chunks plus its parametric memory; the retrieval set is the boundary.

The smaller patterns that complete this one —

  • generalisesHyDEHave the LLM write a hypothetical answer document, embed it, and use it as the retrieval query.
  • generalisesContextual RetrievalPrepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
  • generalisesVector Memory★★Store memories as embeddings in a vector index and retrieve the most semantically similar items at query time.
  • generalisesRAFTTrain the model to ignore irrelevant retrieved documents (distractors) in a domain-specific RAG setting.
  • generalisesHybrid Search★★Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
  • generalisesQuery Rewriting★★Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.
  • generalisesHippoRAGBuild an LLM-extracted schemaless knowledge graph from the corpus and run Personalized PageRank seeded on the query's key concepts so multi-hop retrieval completes in a single pass.
  • generalisesHierarchical Retrieval★★Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.

And the patterns that stand alongside it, or against it —

  • composes-withCross-Encoder Reranking★★After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
  • alternative-toGraphRAGBuild an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries.
  • conflicts-withNaive-RAG-FirstAnti-pattern: reach for naive RAG before checking whether the knowledge actually needs retrieval.
  • composes-withChain of VerificationReduce hallucination by drafting an answer, generating independent verification questions, answering them in isolation, and revising.
  • complementsCitation Streaming★★Stream citations alongside generated text so the UI can render source links in place as content appears.
  • alternative-toHallucinated CitationsAnti-pattern: let the model emit citations as free text and trust them.
  • complementsOver-Search and Under-SearchAnti-pattern: let an agentic RAG system miscalibrate when to retrieve, so it either re-retrieves information already in context or skips retrieval when its parametric knowledge is stale.
  • complementsStreaming Feature PipelineProcess raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
  • complementsFTI LLM Pipeline Split★★Decompose an LLM/RAG system into three independently-deployable pipelines — feature, training, inference — communicating only via a feature store and a model registry.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.