PageIndex

also known as PageIndex (VectifyAI)

Type: full-code · Vendor: VectifyAI · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2025

Links: homepage docs repo

Retrieve over long, structured documents without a vector database by building a table-of-contents tree from each document and having an LLM reason down that tree to the relevant sections, returning page and section references instead of vector hits.

Description. PageIndex is VectifyAI's MIT-licensed open-source retrieval engine for long professional documents (financial filings, contracts, regulatory manuals, technical specifications). It is deliberately vectorless: at index time it parses a document into a hierarchical table-of-contents tree with a summary at each node and keeps leaf text intact rather than splitting it into fixed-size chunks; it computes no embeddings and builds no vector store. At query time it presents the tree to an LLM, which performs a tree search — judging which branch is most likely to hold the answer, descending, and repeating — and returns the leaf sections it reaches together with their page and section identifiers, so every retrieval is traceable to a named location in the source. The project ships a Python library (run_pageindex.py), an MCP server that exposes the tree navigator to MCP clients, and an example built on the OpenAI Agents SDK that swaps a vector-search tool for the tree-navigation tool. It positions itself as an alternative to embedding-based retrieval-augmented generation for documents where vocabulary overlap misleads similarity search and where auditable, citation-grounded retrieval matters.

Agent loop shape. Index time: parse document -> table-of-contents tree with per-node summaries (no embeddings, no chunking). Query time: the LLM reasons over the tree, chooses a branch, descends, and repeats until it reaches the relevant leaf sections; retrieval returns those sections with page and section references for the generator to answer over.

Primary use cases

retrieval and question answering over long structured documents (financial filings, contracts, manuals)
auditable retrieval where every result must point to a named page and section
agentic document analysis where a tree-navigation tool replaces a vector-search tool

flowchart TD DOC["Long structured document"] --> IDX["Index: parse into table-of-contents tree with node summaries"] IDX --> TREE["Tree index - no vectors, no chunks"] Q["Query"] --> NAV{"LLM tree search over node summaries"} TREE --> NAV NAV -->|descend branch| NAV NAV -->|leaves reached| SEC["Selected leaf sections + page/section refs"] SEC --> GEN["Generator answers with citations"]

Key concepts

Vectorless retrieval → vectorless-reasoning-retrieval — No embeddings and no vector store are built; the document structure is the index.
Table-of-contents tree → vectorless-reasoning-retrieval — The document's own section hierarchy, with a summary per node, is the index that is navigated.
LLM tree search → vectorless-reasoning-retrieval — Reasoning, not similarity, decides each navigation step down the tree to the relevant sections.
Page and section references → citation-attribution — Retrieval results are traceable named locations rather than opaque vector hits, feeding citations directly.
Tree navigator as a tool → agentic-rag — An MCP server and an OpenAI Agents SDK example expose the tree navigator as a retrieval tool inside an agent loop.