Framework · Research Agents

PageIndex

Retrieve over long, structured documents without a vector database by building a table-of-contents tree from each document and having an LLM reason down that tree to the relevant sections, returning page and section references instead of vector hits.

Description

PageIndex is VectifyAI's MIT-licensed open-source retrieval engine for long professional documents (financial filings, contracts, regulatory manuals, technical specifications). It is deliberately vectorless: at index time it parses a document into a hierarchical table-of-contents tree with a summary at each node and keeps leaf text intact rather than splitting it into fixed-size chunks; it computes no embeddings and builds no vector store. At query time it presents the tree to an LLM, which performs a tree search — judging which branch is most likely to hold the answer, descending, and repeating — and returns the leaf sections it reaches together with their page and section identifiers, so every retrieval is traceable to a named location in the source. The project ships a Python library (run_pageindex.py), an MCP server that exposes the tree navigator to MCP clients, and an example built on the OpenAI Agents SDK that swaps a vector-search tool for the tree-navigation tool. It positions itself as an alternative to embedding-based retrieval-augmented generation for documents where vocabulary overlap misleads similarity search and where auditable, citation-grounded retrieval matters.

Solution

Index time: parse document -> table-of-contents tree with per-node summaries (no embeddings, no chunking). Query time: the LLM reasons over the tree, chooses a branch, descends, and repeats until it reaches the relevant leaf sections; retrieval returns those sections with page and section references for the generator to answer over.

Primary use cases

  • retrieval and question answering over long structured documents (financial filings, contracts, manuals)
  • auditable retrieval where every result must point to a named page and section
  • agentic document analysis where a tree-navigation tool replaces a vector-search tool

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.