Query Rewriting

also known as Multi-Query Retrieval, Query Expansion, Query Reformulation, RAG-Fusion (query side)

Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.

This pattern helps complete certain larger patterns —

specialisesNaive RAG★★— Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.

Context

A team runs retrieval over a corpus where the user's natural phrasing is only one of many ways to express the same information need. The corpus chunks may use different vocabulary, abbreviations, or framing for the same concept, and an embedding-based lookup against a single query vector lands in only one neighbourhood of the embedding space. Users themselves under-specify, ask compound questions, or use idioms the corpus does not echo.

Problem

A single query embedding samples only one point in the semantic space and retrieves only the chunks closest to that point. Relevant chunks expressed in different vocabulary, at a different specificity level, or framed as a different sub-question are missed entirely, and no downstream reranker can rescue a chunk that was never retrieved. The user's first phrasing is a noisy estimator of intent, and recall is bottlenecked by how well that one phrasing aligns with how the answer chunks were written.

Forces

More query variants improve recall but multiply retrieval cost linearly.
Variants generated by the LLM may drift off-topic and inject noise into the result set.
Fusion strategy (union, RRF, weighted) decides whether rare-but-relevant chunks survive deduplication.
Latency budget bounds how many parallel retrievals the system can afford per request.

Example

A documentation assistant gets the query 'why is my deploy slow?' A single embedding lookup retrieves a handful of chunks about deploy speed in general. The rewriter produces variants — 'deployment latency causes', 'slow CI pipeline diagnosis', 'docker image build time bottlenecks', 'kubernetes rollout delay reasons' — and runs retrieval for each in parallel. The fused result set now contains chunks about image-build caching and rollout strategy that the original query embedding sat too far from. Reciprocal Rank Fusion across the five rankings surfaces the chunks that appear in multiple lists.

Diagram

flowchart TD Q[Original query] --> R[LLM rewriter] R --> V1[Variant 1] R --> V2[Variant 2] R --> V3[Variant N] Q --> V0[Original kept as variant 0] V0 --> Ret[Retriever] V1 --> Ret V2 --> Ret V3 --> Ret Ret --> F[Rank fusion / RRF] F --> Out[Top-N to generator or reranker]

Solution

Therefore:

At query time, prompt an LLM to produce N reformulations of the user's query (typically 3–5) covering paraphrase, decomposition into sub-questions, and specificity shifts. Retrieve top-k chunks for each variant in parallel. Fuse the result lists with Reciprocal Rank Fusion or a deduplicated union, then pass the fused top-N forward to the generator or to a downstream reranker. The original query is included as one of the variants so the system never does worse than a single-query baseline.

What this pattern forbids. The retriever cannot be driven by the user's original query alone; the result set is the rank-fusion across all generated variants plus the original.

And the patterns that stand alongside it, or against it —

composes-withHybrid Search★★— Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
composes-withCross-Encoder Reranking★★— After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
alternative-toHyDE★— Have the LLM write a hypothetical answer document, embed it, and use it as the retrieval query.
composes-withModular RAG★— Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.
composes-withHierarchical Retrieval★★— Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.