V · MemoryExperimental·

Sleep-Time Compute

also known as Offline Pre-Computation, Anticipatory Context Distillation, Background Thinking, Latency-Free Pre-Answering

During idle or downtime, run the model offline against the user's standing context to pre-compute dense summaries and likely future answers, so test-time latency and cost drop when the user actually asks.

Context

A team is running an agent over persistent user context — a codebase, a set of documents, transcripts of prior sessions — that the user queries repeatedly. Many of the queries are predictable variants of previous ones, and the underlying corpus does not change between most of those queries. The provider infrastructure also has idle capacity between user sessions when nobody is actively waiting for an answer.

Problem

Conventional inference does all the work at test time, when the user is waiting. For every query the system parses the corpus, finds what matters, reasons about it, and produces an answer; the next query repays this work from scratch even if it is asking something very similar. Prompt caching helps only when the prefix matches exactly. The user therefore pays latency on every question even though many questions about a stable corpus could have been pre-processed during idle periods — yielding indices, summaries, or partial answers that would have made the eventual user-visible step nearly instantaneous.

Forces

Test-time latency is what the user feels; offline latency is invisible.
Most queries against a stable corpus are predictable variants — predict and pre-answer once.
Prefetching wastes compute on queries that never come, so prediction must be cheap and recoverable.
Prompt caching only helps for matching prefixes; speculative pre-answering generates new content.
Pre-computed answers stale as the corpus changes — freshness vs cost trade-off.

Example

A developer agent has indexed a 200K-file monorepo as the user's standing context. Overnight it runs a distillation pass that summarizes each top-level module and predicts likely next-day queries from the user's commit history and yesterday's questions. When the developer asks the next morning 'what changed in the billing module last week and which tests cover it', the agent retrieves a pre-answer generated at 03:00 that morning and adapts it with one extra inference call instead of re-walking the repo from scratch.

Diagram

flowchart TD subgraph OFFLINE[Offline / idle] SCH[Idle scheduler] --> DIST[Distillation pass] DIST --> SUM[Per-file / per-topic summaries] SCH --> SPEC[Speculative-query generator] SPEC --> PA[Pre-answer pass] PA --> CACHE[(Pre-answer cache<br/>embedding-indexed)] end subgraph LIVE[Test time] Q[User query] --> LK[Embedding lookup] LK -->|hit| HIT[Return / lightly adapt pre-answer] LK -->|miss| INF[Live inference] INF --> APP[Append query to prediction set] end CACHE -.-> LK APP -.-> SPEC

Solution

Therefore:

Run two kinds of offline passes against the user's standing context. (1) Distillation: compress the corpus into structured summaries — per-file, per-module, per-topic — that capture what queries would likely need. (2) Speculative pre-answering: predict likely next queries (from query history, recent context, structural signals) and generate answers ahead of time, stored against query embeddings. At test time, the agent first checks the speculative cache; on a hit it returns or lightly adapts the pre-answer; on a miss it falls back to live inference but adds the new query to the prediction set. Pre-computed material is invalidated when its source documents change. The Letta team and Lin et al. report substantial test-time cost and latency reductions on this pattern.

What this pattern forbids. The agent must not return a stale pre-computed answer when its source documents have changed since pre-computation; freshness checks must gate cache hits. Speculative pre-answers must be marked as such in the trace so downstream evaluation can distinguish them from live inference.

The smaller patterns that complete this one —

usesCross-Session Memory★★— Persist user-specific facts, preferences, and prior context across all sessions, threads, and devices.

And the patterns that stand alongside it, or against it —

complementsEpisodic Summaries★★— Compress past episodes into summaries that preserve gist while shedding token cost.
complementsContext Window Packing★★— Choose what fits in the context window each turn given a fixed token budget.
alternative-toDream Consolidation Cycle★— Run a deeper, slower reflection pass distinct from per-tick reflection — reading hours of recent thoughts, promoting themes, releasing affective residue, and clearing working memory — so the agent does not accumulate residue indefinitely.
alternative-toTest-Time Compute Scaling★★— Allocate more inference-time compute (samples, search, deeper thinking) instead of scaling parameters to improve quality.
complementsPrompt Caching★★— Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.
complementsAdaptive Compute Allocation★— Allocate inference-time compute (thinking tokens, samples, depth, model size) per query based on input difficulty, rather than using a fixed budget across all queries.
complementsContext Compaction★— When the context window nears its limit, replace the older conversation span with a model-written digest that preserves decisions, commitments, and active constraints while discarding noise, so the agent keeps running without losing the thread.
complementsAdaptive Memory Decay★— Give each long-term memory item a retention score that decays over time through a function modulated by relevance, access frequency, and recency, so unreinforced items fade or fuse while items that are used persist.
complementsSpeculative Agentic Actions·— Predict the tool calls the agent is most likely to issue next and execute them preemptively on the current turn, then keep the results that the confirmed trajectory needs and discard the rest.