Prompt Caching

also known as Cache-Aware Prompts, Stable-Prefix Caching

Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.

This pattern helps complete certain larger patterns —

used-byContextual Retrieval★— Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.

Context

A team is running an agent that calls the same large language model many times per session. Most of each prompt is a stable prefix that does not change between calls (system prompt, tool definitions, charter, code-style rules) and only a small suffix varies (the current user message, the latest tool result). The provider's API exposes a prompt cache keyed on byte-identical prefixes.

Problem

Re-sending an identical 10,000-token prefix on every call burns input tokens that the provider would otherwise serve from a warm cache, and it adds time-to-first-token latency for content the model has already seen. Cache hits are silent — a single accidental mutation in the prefix (a timestamp in the system prompt, a tool list reordered by JSON object iteration, a per-call correlation ID) invalidates the cache without any error, so the team can spend months overpaying without realising the cache never warmed.

Forces

Cache TTL caps savings (idle agents lose the warm cache) vs always-fresh prefix.
Stability for cache-hit vs flexibility to mutate the prompt.
Engineering rigor on prompt order vs developer ergonomics.

Example

A coding agent ships a 12k-token system prompt that includes tool schemas, charter, and code-style rules, and per-call costs feel high. Inspecting the cache-hit metric shows zero hits because the per-call user message is being prepended to the system prompt by accident, breaking the byte-stable prefix. The team applies prompt-caching discipline: stable content (system prompt, tool definitions, charter) moves to the start; variable content (current state, user message) moves to the end; the cache breakpoint is marked at the boundary. Cache hit rate jumps to over 90 percent and per-call cost halves.

Diagram

flowchart TD SP[Stable: system + tools + charter] -->|cache breakpoint| CB[(Cached prefix)] CB --> V[Variable: state + user message] V --> LLM[LLM call] LLM --> R[Response<br/>cheaper, faster]

Solution

Therefore:

Place all stable content (system prompt, tool definitions, charter, rules) at the start of the prompt. Place variable content (current state, user message) at the end. Mark the cache breakpoint at the boundary. Audit prompt construction to ensure no accidental prefix mutation.

What this pattern forbids. The cached prefix is forbidden from changing call to call; mutation invalidates the cache.

And the patterns that stand alongside it, or against it —

complementsCost Gating★★— Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
complementsReasoning Trace Carry-Forward★— For reasoning models that emit a separate reasoning trace, preserve that trace in context across the same logical task episode (across tool-call/result turns) but drop it at user-turn boundaries.
complementsNow-Anchoring·— Ground the agent's reasoning in the current absolute time without requiring tool calls, so every reply is implicitly time-aware.
complementsSleep-Time Compute·— During idle or downtime, run the model offline against the user's standing context to pre-compute dense summaries and likely future answers, so test-time latency and cost drop when the user actually asks.
complementsTool Loadout Hot-Swap✕— Anti-pattern: add or remove tool definitions during a running task so the tool set the model sees changes from turn to turn.
complementsRealtime API When Batchable✕— Anti-pattern: use the realtime/synchronous model API for workloads whose latency budget would permit batching, paying 2–10× the unit cost for no user-visible benefit.
complementsBusiness + LLM Microservice Split★★— Split an LLM application into a CPU-bound business microservice (retrieval, prompt assembly, orchestration) and a GPU-bound LLM microservice (only model.generate behind REST), so each tier scales on its own hardware budget.
alternative-toSemantic Response Cache★— Embed each query and, when its nearest cached neighbour is within a similarity threshold, return the stored answer instead of re-running the model so near-duplicate questions are answered cheaply.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Production LLM Platform
optional

Used in frameworks

References

Anthropic: Prompt caching
doc

Provenance

Source: patterns/prompt-caching.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-22
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.