Memory

Context Window Packing

Choose what fits in the context window each turn given a fixed token budget.

Problem

Naively concatenating everything overflows the window and the call fails. Naively truncating from the start or the end drops information that may be critical (the original task, the most recent tool result, the system prompt itself). A first-fit packing strategy leaves the model with a different subset on every call, which makes behaviour unpredictable. The team needs a deliberate policy for what is preserved, what is summarised, what is retrieved on demand, and what is dropped — and that policy has to be applied consistently across calls.

Solution

Define a packing policy. Reserve N tokens for system + tools + response. Allocate the rest across history (compressed), retrieved chunks (top-k after rerank), and current state. Use eviction (drop oldest), summarisation (compress), or selection (relevance-rank) policies. Audit token counts before each call.

When to use

Naive concatenation overflows the context window for realistic inputs.
Some context (system, tools, response reservation) is fixed and the rest must be allocated dynamically.
You can audit token counts before each call and adjust the policy.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related