Retrieval & RAG

Streaming Feature Pipeline

Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.

Problem

Batch ingestion lags the source by the rebuild cadence and wastes compute re-processing unchanged documents. Ad-hoc streaming code without a stage-pinning discipline (raw → cleaned → chunked → embedded) accumulates implicit data shape transitions that break silently as the pipeline evolves. Without a typed stream pipeline, real-time RAG ingestion becomes a debug nightmare on every schema or chunking change.

Solution

Use a streaming framework (Bytewax, Flink, Kafka Streams) to consume change events. Define a Pydantic (or equivalent) model per stage: RawDocument → CleanedDocument → ChunkedDocument → EmbeddedDocument. Each stage is a map operation that takes one model and emits the next; type errors surface at the stage boundary. Failed events go to a dead-letter queue for inspection rather than blocking the stream. Upserts to the vector index happen as the embedded model flows out of the last stage.

When to use

  • Real-time RAG ingest is needed and batch lag is unacceptable.
  • Source events can be modelled as a stream (CDC, webhook, queue).
  • Engineering capacity to operate a streaming framework exists.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related