Modular RAG

also known as LEGO RAG, Reconfigurable RAG, 模块化RAG, Module-Type / Module / Operator RAG

Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.

Context

A team has shipped a basic RAG pipeline and the workload has fragmented. Some queries need query rewriting plus reranking; others need a knowledge-graph hop; others want a direct semantic lookup without rerank; some need a routing decision between two corpora. Hard-coding one linear pipeline for the worst-case query wastes latency and cost on the cheap ones, and shipping a second pipeline duplicates everything.

Problem

A fixed Naive RAG pipeline is too rigid for heterogeneous workloads: every retrieval flows through the same retrieve-rerank-generate stages regardless of query shape, paying the worst-case cost on every request. Forking the pipeline per query type duplicates code, splits operational metrics across pipelines, and loses the ability to share modules. There is no contract between stages, so swapping a reranker, adding a query rewriter, or routing between corpora requires touching the pipeline orchestration directly.

Forces

Heterogeneous query mix wants different pipelines, but operating many forked pipelines is expensive.
Sharing modules across pipelines requires a typed contract between stages.
Per-query routing and fusion add latency that must be paid for in recall or cost saved elsewhere.
Reconfigurability invites combinatorial explosion of pipeline shapes that are hard to evaluate.

Example

A documentation assistant serves three query shapes: lookup queries ('what is the default timeout?'), exploratory queries ('how does the deployment pipeline handle failures?'), and entity-multi-hop queries ('which team owns the service that calls X?'). Under a fixed pipeline, all three pay for query rewriting + dense + sparse + rerank + reasoning. Under Modular RAG, the Orchestration Module routes lookup queries through Retrieval(dense)→Generation only; exploratory queries through Pre-Retrieval(query-rewriting)→Retrieval(hybrid)→Post-Retrieval(rerank)→Generation; multi-hop queries through Retrieval(graph)→Generation. Modules are shared; only the assembled pipeline differs.

Diagram

flowchart TD Q[Query] --> O[Orchestration Module] O --> PR[Pre-Retrieval Module<br/>query-rewriting / HyDE / decomp] PR --> R[Retrieval Module<br/>dense / sparse / hybrid / graph] R --> PO[Post-Retrieval Module<br/>rerank / compress / filter] PO --> G[Generation Module] I[Indexing Module] -.builds.-> R G --> Out[Answer]

Solution

Therefore:

Define six Module Types covering the RAG lifecycle (Indexing, Pre-Retrieval, Retrieval, Post-Retrieval, Generation, Orchestration). Within each, name concrete Modules (e.g. under Pre-Retrieval: Query Rewriting, HyDE, Decomposition). Implement each Module from typed Operators (atomic, swappable steps). At request time, an Orchestration Module assembles a pipeline by picking one Module per stage, possibly with branching, conditional routing, and fusion. Modules expose a typed input/output contract so any compatible Module can swap in; new modules ship without touching orchestration.

What this pattern forbids. Pipelines may only be composed from named Modules implementing typed Operator contracts; bespoke retrieval logic outside the Module inventory is forbidden, so all pipeline shapes are inspectable and replaceable.

The smaller patterns that complete this one —

generalisesNaive RAG★★— Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.

And the patterns that stand alongside it, or against it —

alternative-toAgentic RAG★★— Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
composes-withHybrid Search★★— Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
composes-withCross-Encoder Reranking★★— After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
composes-withQuery Rewriting★★— Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.
composes-withHierarchical Retrieval★★— Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.
complementsBehavior-Space Architecture★— Treat a deployed agent as a space of behaviors over a pool of subsystems and let a router pick, per query, the minimal disjoint subset that query needs, so the effective architecture emerges per query.