Pathway

Type: full-code · Vendor: Pathway · Language: Python, Rust · License: BSL-1.1 · Status: active · Status in practice: emerging · First released: 2023

Links: homepage docs repo

Pathway is a Python and Rust framework that builds streaming data and RAG pipelines on a differential-dataflow engine, keeping vector indexes synchronised with their sources as data changes.

Description. Pathway runs the same pipeline code over batch and streaming data, with a Rust engine that recomputes results incrementally as inputs change. Its LLM extension provides parsers, embedders, splitters, and an in-memory real-time vector index, plus document-store components that re-index automatically on new data. Teams use it to build retrieval-augmented generation pipelines whose indexes stay current without a separate batch re-indexing job. The framework integrates with LlamaIndex and LangChain.

Agent loop shape. Pathway is a data-pipeline runtime rather than an agent loop. Connectors ingest documents and change events; the differential-dataflow engine parses, chunks, embeds, and indexes them incrementally; and a vector or document store serves retrieval queries. A question-answering layer retrieves the top documents for a query and passes them to an LLM, with the index kept in sync as sources change.

Primary use cases

real-time retrieval-augmented generation pipelines
live document indexing and vector search
incremental ETL over streaming and batch data
continuously synchronised knowledge bases for agents

flowchart TD fw["Pathway"] fw --> p1["Streaming Feature Pipeline<br/>(core)"] fw --> p2["Agentic RAG<br/>(supported)"] fw --> p3["Naive RAG<br/>(supported)"]

Key concepts

DocumentStore → streaming-feature-pipeline (docs) — A pipeline component that automatically indexes documents and updates itself when new data arrives, so the served index reflects the source without a separate batch re-indexing job.
Differential dataflow engine (docs) — A scalable Rust engine that performs incremental computation, recomputing only what changed as inputs update, and runs the same Python pipeline code over both batch and streaming data.
In-memory real-time Vector Index → vector-memory (docs) — A vector index kept inside the running pipeline that stays synchronised with data sources in real time, serving similarity queries for RAG over live documents.
AdaptiveRAGQuestionAnswerer → agentic-rag (docs) — A RAG question-answerer that limits the number of documents sent to the LLM chat to save tokens, adjusting retrieval to the query.