Full-Code · Research Agentsactive

PageIndex

also known as PageIndex (VectifyAI)

Type: full-code · Vendor: VectifyAI · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2025

Links: homepage docs repo

Retrieve over long, structured documents without a vector database by building a table-of-contents tree from each document and having an LLM reason down that tree to the relevant sections, returning page and section references instead of vector hits.

Description. PageIndex is VectifyAI's MIT-licensed open-source retrieval engine for long professional documents (financial filings, contracts, regulatory manuals, technical specifications). It is deliberately vectorless: at index time it parses a document into a hierarchical table-of-contents tree with a summary at each node and keeps leaf text intact rather than splitting it into fixed-size chunks; it computes no embeddings and builds no vector store. At query time it presents the tree to an LLM, which performs a tree search — judging which branch is most likely to hold the answer, descending, and repeating — and returns the leaf sections it reaches together with their page and section identifiers, so every retrieval is traceable to a named location in the source. The project ships a Python library (run_pageindex.py), an MCP server that exposes the tree navigator to MCP clients, and an example built on the OpenAI Agents SDK that swaps a vector-search tool for the tree-navigation tool. It positions itself as an alternative to embedding-based retrieval-augmented generation for documents where vocabulary overlap misleads similarity search and where auditable, citation-grounded retrieval matters.

Agent loop shape. Index time: parse document -> table-of-contents tree with per-node summaries (no embeddings, no chunking). Query time: the LLM reasons over the tree, chooses a branch, descends, and repeats until it reaches the relevant leaf sections; retrieval returns those sections with page and section references for the generator to answer over.

Primary use cases

  • retrieval and question answering over long structured documents (financial filings, contracts, manuals)
  • auditable retrieval where every result must point to a named page and section
  • agentic document analysis where a tree-navigation tool replaces a vector-search tool

Key concepts

  • Vectorless retrieval vectorless-reasoning-retrievalNo embeddings and no vector store are built; the document structure is the index.
  • Table-of-contents tree vectorless-reasoning-retrievalThe document's own section hierarchy, with a summary per node, is the index that is navigated.
  • LLM tree search vectorless-reasoning-retrievalReasoning, not similarity, decides each navigation step down the tree to the relevant sections.
  • Page and section references citation-attributionRetrieval results are traceable named locations rather than opaque vector hits, feeding citations directly.
  • Tree navigator as a tool agentic-ragAn MCP server and an OpenAI Agents SDK example expose the tree navigator as a retrieval tool inside an agent loop.

Patterns this full-code implements —

References

Provenance

  • Last analyzed:
  • Last updated:
  • Verification status: partial