IV · Retrieval & RAGMature★★

Cross-Encoder Reranking

also known as Reranker, Two-Stage Retrieval, Retrieve-Then-Rerank

After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).

Context

A team is using a two-stage retrieval pipeline. The first stage is a fast bi-encoder that embeds the query and each document independently and compares their vectors; an approximate nearest-neighbour index returns a top-k candidate set from a large corpus. Because the encoder sees query and document separately, it cannot model fine-grained interactions between them, and because the index is tuned for recall, the top-k list mixes truly relevant candidates with topically similar but unhelpful ones.

Problem

Feeding the entire top-k list into the downstream generator wastes its context window on irrelevant candidates and lets the loudest distractor mislead the answer. The team needs a way to re-order or filter the candidate set so that the most relevant items rise to the top, but they cannot afford to run a heavy joint scoring model over the whole corpus on every query. They need a small but expensive scorer that runs only over the cheap retriever's shortlist and resorts it by genuine query-document relevance.

Forces

  • Cross-encoder cost is one model call per candidate.
  • Latency budget caps N (typically 20-100).
  • Fine-tuning a custom reranker is a separate effort.

Example

A legal-research agent retrieves 100 candidate paragraphs from a corpus of contracts that mention 'force majeure'. Many are off-topic. Before showing them to the LLM, a small cross-encoder model scores each candidate against the user's exact question, picks the top 5, and discards the rest. The LLM only ever reads the sharpest results.

Diagram

Solution

Therefore:

Two-stage retrieval. Stage 1: cheap retrieve (BM25, dense, hybrid) returns top-N. Stage 2: cross-encoder scores each (query, candidate) jointly. Return top-K << N to the generator.

What this pattern forbids. The generator sees only the reranker's top-K; pre-rerank candidates are not used.

And the patterns that stand alongside it, or against it —

  • composes-withNaive RAG★★Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.
  • composes-withHybrid Search★★Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.
  • composes-withAgentic RAG★★Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
  • composes-withContextual RetrievalPrepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
  • composes-withHyDEHave the LLM write a hypothetical answer document, embed it, and use it as the retrieval query.
  • composes-withQuery Rewriting★★Use an LLM to generate several alternative formulations of the user's query, retrieve documents for each, and rank-fuse the results so recall does not depend on one phrasing.
  • composes-withHippoRAGBuild an LLM-extracted schemaless knowledge graph from the corpus and run Personalized PageRank seeded on the query's key concepts so multi-hop retrieval completes in a single pass.
  • composes-withModular RAGDecompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.
  • composes-withHierarchical Retrieval★★Route a query through a multi-level cascade — coarse source or index selection, then per-source narrower retrieval, then chunk-level — so each retrieval decision is pushed to the cheapest tier that can answer it.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.