PageIndex
also known as PageIndex (VectifyAI)
Type: full-code · Vendor: VectifyAI · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2025
Retrieve over long, structured documents without a vector database by building a table-of-contents tree from each document and having an LLM reason down that tree to the relevant sections, returning page and section references instead of vector hits.
Description. PageIndex is VectifyAI's MIT-licensed open-source retrieval engine for long professional documents (financial filings, contracts, regulatory manuals, technical specifications). It is deliberately vectorless: at index time it parses a document into a hierarchical table-of-contents tree with a summary at each node and keeps leaf text intact rather than splitting it into fixed-size chunks; it computes no embeddings and builds no vector store. At query time it presents the tree to an LLM, which performs a tree search — judging which branch is most likely to hold the answer, descending, and repeating — and returns the leaf sections it reaches together with their page and section identifiers, so every retrieval is traceable to a named location in the source. The project ships a Python library (run_pageindex.py), an MCP server that exposes the tree navigator to MCP clients, and an example built on the OpenAI Agents SDK that swaps a vector-search tool for the tree-navigation tool. It positions itself as an alternative to embedding-based retrieval-augmented generation for documents where vocabulary overlap misleads similarity search and where auditable, citation-grounded retrieval matters.
Agent loop shape. Index time: parse document -> table-of-contents tree with per-node summaries (no embeddings, no chunking). Query time: the LLM reasons over the tree, chooses a branch, descends, and repeats until it reaches the relevant leaf sections; retrieval returns those sections with page and section references for the generator to answer over.
Primary use cases
- retrieval and question answering over long structured documents (financial filings, contracts, manuals)
- auditable retrieval where every result must point to a named page and section
- agentic document analysis where a tree-navigation tool replaces a vector-search tool
Key concepts
- Vectorless retrieval → vectorless-reasoning-retrieval — No embeddings and no vector store are built; the document structure is the index.
- Table-of-contents tree → vectorless-reasoning-retrieval — The document's own section hierarchy, with a summary per node, is the index that is navigated.
- LLM tree search → vectorless-reasoning-retrieval — Reasoning, not similarity, decides each navigation step down the tree to the relevant sections.
- Page and section references → citation-attribution — Retrieval results are traceable named locations rather than opaque vector hits, feeding citations directly.
- Tree navigator as a tool → agentic-rag — An MCP server and an OpenAI Agents SDK example expose the tree navigator as a retrieval tool inside an agent loop.
Patterns this full-code implements —
- ·Vectorless Reasoning-Based Retrieval
PageIndex is the reference implementation of the pattern: it builds the table-of-contents tree and has the LLM tree-search it for retrieval, with no embeddings at any tier.
- ★★Citation Attribution
Retrieval returns explicit page and section references rather than opaque vector hits, so answers are grounded in named, checkable locations.
- ★★Agentic RAG
Ships an example built on the OpenAI Agents SDK that drives the tree navigator as a retrieval tool inside an agent loop, in place of a vector-search tool.