Naive RAG

also known as Retrieval-Augmented Generation, Top-K Retrieve-and-Stuff

Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.

This pattern helps complete certain larger patterns —

specialisesAgentic RAG★★— Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
used-byApp Exploration Phase·— Before deploying an agent against an opaque app, have it explore (or watch a human demonstrate) the app, generating a per-element documentation knowledge base; at deployment, retrieve element docs to ground actions.
used-byAugmented LLM★★— Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.
used-byCitation Attribution★★— Track and surface, alongside a RAG-grounded answer, which retrieved chunks supported which claims, so the binding between answer span and source survives all the way to the user.
specialisesModular RAG★— Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.
specialisesTable-Augmented Generation·— Answer a natural-language question over a database in three stages — synthesise an executable query, run it against the data layer with model calls embedded in execution, then generate the answer from the result.

Context

A team needs a model to answer questions whose answers depend on information that lives in a corpus too large to fit into the prompt — internal documentation, a knowledge base, a product catalogue, recent news, a body of research papers. The corpus also changes regularly, faster than retraining the base model would allow, so any answers based on the model's training data alone will go stale or be missing entirely.

Problem

A bare language model has no access to information beyond what is baked into its weights, and any attempt to answer from parametric memory alone tends to hallucinate plausible-sounding answers, cannot cite a source, and cannot be updated without retraining. The team needs the model to pull relevant external knowledge in at query time, but doing so requires deciding how to chunk the corpus, how to index it, what to retrieve per query, and how to feed it into the prompt. Without that retrieval machinery, the model is stuck with what it already knew at training time.

Forces

Chunk size trades context loss for retrieval recall.
Embedding choice constrains retrieval quality.
Single-shot retrieval misses multi-hop questions.

Example

A startup ships a support assistant whose knowledge changes weekly — release notes, pricing, integration guides. Bake-it-into-the-prompt does not scale and fine-tuning on every release is impractical. They adopt naive-rag: chunk the docs, embed with a dense encoder, index, and at query time retrieve top-k and prepend to the prompt. The pipeline is the simplest possible and ships in a week. Knowledge updates now flow by re-indexing the docs, not by retraining or redeploying the model.

Diagram

Solution

Therefore:

Chunk the corpus. Embed each chunk with a dense encoder. At query time, embed the query, retrieve top-k by similarity, prepend chunks to the prompt, generate. The simplest production RAG pipeline.

What this pattern forbids. The generator may use only retrieved chunks plus its parametric memory; the retrieval set is the boundary.

The smaller patterns that complete this one —

generalisesHyDE★— Have the LLM write a hypothetical answer document, embed it, and use it as the retrieval query.
generalisesContextual Retrieval★— Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
generalisesVector Memory★★— Store memories as embeddings in a vector index and retrieve the most semantically similar items at query time.
generalisesRAFT★— Train the model to ignore irrelevant retrieved documents (distractors) in a domain-specific RAG setting.
generalisesHybrid Search★★— Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
generalisesQuery Rewriting★★— Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.
generalisesHippoRAG★— Build an LLM-extracted schemaless knowledge graph from the corpus and run Personalized PageRank seeded on the query's key concepts so multi-hop retrieval completes in a single pass.
generalisesHierarchical Retrieval★★— Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.

And the patterns that stand alongside it, or against it —

composes-withCross-Encoder Reranking★★— After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
alternative-toGraphRAG★— Build an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries.
conflicts-withNaive-RAG-First✕— Anti-pattern: reach for naive RAG before checking whether the knowledge actually needs retrieval.
composes-withChain of Verification★— Reduce hallucination by drafting an answer, generating independent verification questions, answering them in isolation, and revising.
complementsCitation Streaming★★— Stream citations alongside generated text so the UI can render source links in place as content appears.
alternative-toHallucinated Citations✕— Anti-pattern: let the model emit citations as free text and trust them.
complementsOver-Search and Under-Search✕— Anti-pattern: let an agentic RAG system miscalibrate when to retrieve, so it either re-retrieves information already in context or skips retrieval when its parametric knowledge is stale.
complementsStreaming Feature Pipeline★— Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
complementsFTI LLM Pipeline Split★★— Decompose an LLM/RAG system into three independently-deployable pipelines — feature, training, inference — communicating only via a feature store and a model registry.
alternative-toVectorless Reasoning-Based Retrieval·— Retrieve by having the model reason its way down a document's own table-of-contents tree to the relevant sections, instead of embedding chunks and ranking them by vector similarity.
complementsSemantic Response Cache★— Embed each query and, when its nearest cached neighbour is within a similarity threshold, return the stored answer instead of re-running the model so near-duplicate questions are answered cheaply.

Naive RAG

Context

Problem

Forces

Example

Diagram

Solution

Neighbourhood

Used in recipes

Used in frameworks

References

Provenance