Memory

Landmark Attention

Long-context attention mechanism placing sparse landmark tokens across very long inputs so the model jumps directly to relevant sections via landmark lookup rather than scanning linearly.

Problem

Standard attention's quadratic cost limits practical context; positional bias means content in the middle of the context performs worse on retrieval than content at the ends. Naive truncation loses information; sliding-window attention loses long-range structure.

Solution

Mohtashami & Jaggi 2023 — augment the input with landmark tokens at topic / section / chunk boundaries. The model's attention learns to use landmarks as a sparse index, enabling random-access lookup across very long contexts. Effective context length extends significantly. Pair with information-chunking-memory, lost-in-the-middle (addresses), context-window-packing.

When to use

  • Very long inputs that exceed standard attention's effective range.
  • Model-side support for landmark attention is available.
  • Retrieval accuracy from middle of context matters.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related