Retrieval & RAG

Semantic Response Cache

Embed each query and, when its nearest cached neighbour is within a similarity threshold, return the stored answer instead of re-running the model so near-duplicate questions are answered cheaply.

Problem

An exact-match cache keyed on the raw text misses two queries that mean the same thing but read differently, so the expensive pipeline runs again for an answer that already exists. Paraphrase, word order, punctuation, and filler all defeat string equality. The service needs a cache that recognises that two differently worded queries are close enough in meaning to share one answer, without re-running the model to find out.

Solution

Keep a vector store of prior (query, answer) pairs. On each request, embed the query and search the store for its nearest neighbour. If the neighbour's similarity clears a configured threshold, return the stored answer immediately and skip the model entirely. If nothing clears the threshold, run the full LLM or retrieval pipeline, return its answer, and insert the new (query embedding, answer) pair so the next near-duplicate hits. The threshold is the central knob: set it tight enough that only genuinely equivalent queries share an answer, and pair it with a time-to-live or an invalidation hook so entries do not outlive the data they summarise. Scope the store per tenant or per user when answers depend on identity, so one caller's cached answer is never served to another.

When to use

  • A high share of incoming queries are near-duplicates phrased differently, so an exact-match cache rarely hits.
  • Answering is expensive (token cost or latency) and a stored answer is acceptable for equivalent queries.
  • A similarity threshold and a freshness policy can be tuned and monitored against wrong-hit rate.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related