Semantic Response Cache
also known as Vector Response Cache, Similarity Cache, LLM Semantic Cache
Embed each query and, when its nearest cached neighbour is within a similarity threshold, return the stored answer instead of re-running the model so near-duplicate questions are answered cheaply.
Context
A service answers a stream of natural-language queries with an LLM or a retrieval-augmented pipeline, and many of those queries are near-duplicates phrased differently: the same support question, the same lookup, the same intent worded a dozen ways. Each call costs tokens and latency, yet a classic key-value cache only fires on byte-for-byte repeats, so it almost never hits when wording varies.
Problem
An exact-match cache keyed on the raw text misses two queries that mean the same thing but read differently, so the expensive pipeline runs again for an answer that already exists. Paraphrase, word order, punctuation, and filler all defeat string equality. The service needs a cache that recognises that two differently worded queries are close enough in meaning to share one answer, without re-running the model to find out.
Forces
- A looser similarity threshold lifts the hit rate and cuts cost, but raises the risk of returning a stored answer to a query that only looks similar.
- Embedding and vector-searching every incoming query adds work, so the cache only pays off when the hit rate is high enough to cover that overhead.
- Cached answers age: an entry that was correct when stored can go stale as the underlying data or documents change.
- A semantic match hides why an answer was returned, making a wrong hit harder to spot than an exact-key miss.
Example
A customer-support assistant gets the same billing question worded a hundred ways: 'how do I change my card', 'where do I update payment method', 'can I switch the card on file'. Each one embeds close to the others, so after the first is answered by the LLM the rest land on a cached hit and return in milliseconds without another model call, while a genuinely new question falls through to the full pipeline.
Diagram
Solution
Therefore:
Keep a vector store of prior (query, answer) pairs. On each request, embed the query and search the store for its nearest neighbour. If the neighbour's similarity clears a configured threshold, return the stored answer immediately and skip the model entirely. If nothing clears the threshold, run the full LLM or retrieval pipeline, return its answer, and insert the new (query embedding, answer) pair so the next near-duplicate hits. The threshold is the central knob: set it tight enough that only genuinely equivalent queries share an answer, and pair it with a time-to-live or an invalidation hook so entries do not outlive the data they summarise. Scope the store per tenant or per user when answers depend on identity, so one caller's cached answer is never served to another.
What this pattern forbids. A query is answered from the cache only when its nearest stored neighbour clears the similarity threshold; below that threshold the cache must not return, and the full pipeline runs instead.
And the patterns that stand alongside it, or against it —
- alternative-toPrompt Caching★★— Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.
- alternative-toTool Result Caching★★— Cache the result of expensive deterministic tool calls keyed by their arguments so repeat calls within a session return immediately.
- complementsNaive RAG★★— Condition the generator on top-k chunks retrieved from an external dense index so knowledge lives outside parameters.
- complementsAgentic RAG★★— Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.