Hierarchical Retrieval

also known as Cascade Retrieval, Multi-Level Retrieval, Router-Then-Retrieve, Tree Retrieval

Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.

This pattern helps complete certain larger patterns —

specialisesNaive RAG★★— Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.

Context

A team runs retrieval over a heterogeneous knowledge base: several distinct corpora (product docs, support tickets, internal wikis, code, web), each with its own index and its own access cost. A single flat index across the union is either prohibitively expensive to maintain or loses too much fidelity, and querying every index in parallel on every request wastes calls on sources that cannot answer the question. Within each source, documents are themselves structured — chapters contain sections contain paragraphs — and the right granularity for retrieval varies per query.

Problem

Flat retrieval over a single union index pays the cost of querying everything for every question, even when most sources are irrelevant. Fanning out to every retriever in parallel is even worse: latency stacks, costs multiply, and the downstream reranker has to filter noise from sources the query never needed. At the same time, retrieving at one fixed granularity (always paragraphs, or always full documents) mismatches half of the query mix; some questions want a corpus-level answer and some want a single span. The team needs a way to spend retrieval budget proportional to how much routing the query actually requires.

Forces

Each retrieval tier has its own cost, latency, and recall profile; querying all of them is wasteful.
Routing decisions made by an LLM are expensive; routing decisions made by a classifier are cheap but less flexible.
Granularity should follow the query — coarse for overview questions, fine for span-level lookup.

Example

A developer-support agent answers questions over four sources: API reference, runbooks, support-ticket history, and the public web. Flat retrieval hits all four on every query and the reranker has to fight ticket noise on API questions. The team switches to hierarchical retrieval: a classifier routes 'how do I authenticate?' to the API reference, 'why did deploy fail last Tuesday?' to support tickets, and 'is this issue known?' first to runbooks and then to the web only if runbooks return nothing. Inside the API reference, a tree retriever descends from section summaries to the specific code-block chunk. Three of the four retrievers are skipped on most queries; latency drops; reranker noise drops with it.

Diagram

flowchart TD Q[Query] --> R0{Top-level router} R0 -- API docs --> A[API retriever] R0 -- runbooks --> B[Runbook retriever] R0 -- tickets --> C[Ticket retriever] R0 -- web --> D[Web retriever] A --> S[Section index] S --> Ch[Chunk index] B --> Ch2[Chunk index] C --> Ch3[Chunk index] D --> Ch4[Chunk index] Ch --> Agg[Aggregator + reranker] Ch2 --> Agg Ch3 --> Agg Ch4 --> Agg Agg --> Out[Top-K to generator]

Solution

Therefore:

Index the corpus hierarchically: a parser builds parent-child relationships (document → section → chunk, or topic-cluster → document → chunk) and stores both levels. At query time, a top-level router picks the source or sub-index that matches the query (by classifier, by embedding similarity to source summaries, or by an LLM call). The selected source runs its own retriever, optionally a further router or a coarse-to-fine descent (retrieve summaries, then retrieve the children of the top-ranked summaries). The chunk-level retriever returns the final candidates. Compose with cross-encoder reranking on the final candidate set; compose with hybrid search inside each leaf retriever.

What this pattern forbids. Retrieval at any tier sees only the candidates the upstream router selected; sources or sub-trees the router skipped are unreachable for this query.

The smaller patterns that complete this one —

generalisesAgentic RAG★★— Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
usesRouting★★— Classify an incoming request and dispatch it to the specialist (lane / agent / model) best suited to handle it.
usesTopic-Based Routing★— Route inter-agent messages through named topics that agents subscribe to, instead of having senders address each other by id.

And the patterns that stand alongside it, or against it —

composes-withCross-Encoder Reranking★★— After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
composes-withHybrid Search★★— Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
alternative-toGraphRAG★— Build an LLM-extracted entity-and-relation knowledge graph plus hierarchical community summaries, then answer global queries via map-reduce over those summaries.
composes-withModular RAG★— Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.
composes-withQuery Rewriting★★— Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.
composes-withContextual Retrieval★— Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
alternative-toHippoRAG★— Build an LLM-extracted schemaless knowledge graph from the corpus and run Personalized PageRank seeded on the query's key concepts so multi-hop retrieval completes in a single pass.
alternative-toMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
alternative-toVectorless Reasoning-Based Retrieval·— Retrieve by having the model reason its way down a document's own table-of-contents tree to the relevant sections, instead of embedding chunks and ranking them by vector similarity.
alternative-toRepo Map★— Give the agent a compact, ranked map of the codebase's symbols and their dependencies so it orients on what matters before reading any files.