IV · Retrieval & RAGMature★★

CDC-Driven Vector Sync

also known as Change-Data-Capture RAG Sync, Event-Driven Vector Index Update

Treat the source-of-truth document store as the only writer; keep the vector index in sync by emitting change-data-capture events onto a queue that the feature pipeline consumes.

Context

A RAG system reads from a vector index built over a corpus that lives in a source-of-truth store (database, document system, content platform). The corpus changes continuously — inserts, updates, deletes. The vector index must stay in sync or retrieval returns stale or missing material.

Problem

Periodic batch rebuilds of the vector index are expensive, lag the source, and waste compute re-embedding unchanged documents. Dual-writing (the writer updates both the source and the vector index) is brittle: a crash between writes leaves the two stores inconsistent, and the writer code must understand the embedding pipeline. Without an event-driven path from source-of-truth changes to vector-index updates, embeddings drift silently from the corpus and retrieval quality degrades.

Forces

  • The source-of-truth store should be the only writer (single writer principle).
  • Dual-writes from the application leak embedding-pipeline knowledge into the writer.
  • Batch rebuilds waste compute and lag the source.
  • CDC events provide ordered insert/update/delete signal.

Example

A knowledge-base platform stores articles in MongoDB. The vector index over the article corpus must stay current as the editorial team adds, edits, and retires articles. The team enables MongoDB change streams; each change publishes to RabbitMQ. A Bytewax feature pipeline consumes, cleans, chunks, embeds, and upserts into Qdrant. Editors see new articles in RAG within seconds; the editorial system writes only to MongoDB.

Diagram

Solution

Therefore:

Enable change-data-capture on the source-of-truth store (MongoDB change streams, PostgreSQL logical replication, Kafka Connect, Debezium). Publish each change as an event to a queue (Kafka, RabbitMQ, SNS). The feature pipeline subscribes: on insert, embed and upsert; on update, re-embed and overwrite; on delete, remove from the vector index. The writer code knows nothing about embeddings. The pipeline can be paused, redeployed, or backfilled from queue history.

What this pattern forbids. Vector indices over a changing corpus must not be kept in sync by dual-writes from application code; CDC events from the source-of-truth store drive embedding updates.

The smaller patterns that complete this one —

  • usesVector Memory★★Store memories as embeddings in a vector index and retrieve the most semantically similar items at query time.

And the patterns that stand alongside it, or against it —

  • composes-withStreaming Feature PipelineProcess raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
  • composes-withFTI LLM Pipeline Split★★Decompose an LLM/RAG system into three independently-deployable pipelines — feature, training, inference — communicating only via a feature store and a model registry.
  • complementsEvent-Driven Agent★★Trigger the agent on external events (webhooks, message queues, file changes) instead of user requests or schedules.
  • complementsAgentic RAG★★Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.