Context Window Packing
Choose what fits in the context window each turn given a fixed token budget.
Problem
Naively concatenating everything overflows the window and the call fails. Naively truncating from the start or the end drops information that may be critical (the original task, the most recent tool result, the system prompt itself). A first-fit packing strategy leaves the model with a different subset on every call, which makes behaviour unpredictable. The team needs a deliberate policy for what is preserved, what is summarised, what is retrieved on demand, and what is dropped — and that policy has to be applied consistently across calls.
Solution
Define a packing policy. Reserve N tokens for system + tools + response. Allocate the rest across history (compressed), retrieved chunks (top-k after rerank), and current state. Use eviction (drop oldest), summarisation (compress), or selection (relevance-rank) policies. Audit token counts before each call.
When to use
- Naive concatenation overflows the context window for realistic inputs.
- Some context (system, tools, response reservation) is fixed and the rest must be allocated dynamically.
- You can audit token counts before each call and adjust the policy.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.
Related
- Dynamic Scaffolding
- Episodic Summaries
- MemGPT-Style Paging
- Reasoning Trace Carry-Forward
- Salience Attention Mechanism
- Self-Archaeology
- Todo-List-Driven Autonomous Agent
- Tool Search Lazy Loading
- Sleep-Time Compute
- Context Window Dumb-Zone Cap
- Landmark Attention
- Information Chunking for Agent Memory
- Lost in the Middle (Positional Bias)
- Context Compaction
- Tool-Result Eviction