X · Governance & ObservabilityMature★★

Agent Resumption

also known as Durable Execution, Pause-and-Resume, Long-Running Agent State

Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects.

This pattern helps complete certain larger patterns —

  • used-byInterruptible Agent ExecutionTreat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.

Context

A team runs an agent in production that takes minutes or hours to finish a single task, for example scraping and summarising a long list of pages, or driving a multi-step migration. During that time the worker process may be restarted by a deploy, killed by a host failure, or disconnected from the user's session. Operators and end users both expect work in flight to survive these everyday events rather than being thrown away.

Problem

If the agent keeps all of its state in memory and the process dies, the run is gone and the user has to start over, sometimes after waiting forty minutes for nothing. Naively retrying from scratch repeats every side effect that already ran, so emails get sent twice, charges get doubled, and external systems see the same write multiple times. The team is forced to choose between fragile long-running agents and giving up on long-running agents altogether.

Forces

  • Checkpoint frequency vs cost.
  • What to persist; what to recompute.
  • Resumability requires deterministic enough replay or full state capture.

Example

A research agent is forty minutes into a slow scrape-and-summarise run when the operator deploys a hotfix and the worker container restarts. Without persisted state, the run vanishes and the user re-issues the request. The team adds Agent Resumption: every step's plan, tool result, and intermediate state is checkpointed to durable storage, keyed by run id. After the restart, the worker reloads the checkpoint and continues from the next step instead of from scratch.

Diagram

Solution

Therefore:

Two production approaches. (a) Deterministic replay of recorded effects (Temporal/Inngest pattern): state = inputs + log of side-effects; on resume, the engine re-executes the workflow code, skipping side-effects that already have logged results. (b) Checkpoint snapshots of agent state (LangGraph Cloud pattern): periodically serialise plan, working memory, partial outputs, pending tool calls; restore on restart. Both approaches require deterministic idempotency keys passed to side-effect targets so a replayed-but-unlogged call is deduplicated downstream. Without this, crash-between-effect-and-log produces duplicates.

What this pattern forbids. Agent state must be serialisable; non-serialisable in-memory references are forbidden in long-running paths.

The smaller patterns that complete this one —

  • usesShort-Term Thread Memory★★Carry the relevant slice of conversation context across turns within a session.
  • generalisesDurable Workflow SnapshotCapture workflow execution state as a snapshot in a pluggable storage provider so a paused run can resume across deployments, process restarts, and host crashes.

And the patterns that stand alongside it, or against it —

  • complementsScheduled Agent★★Run the agent on a fixed schedule independent of user requests.
  • complementsEvent-Driven Agent★★Trigger the agent on external events (webhooks, message queues, file changes) instead of user requests or schedules.
  • complementsTodo-List-Driven Autonomous AgentHave the agent author a plan file (e.g. todo.md) early in the run, tick items as it completes them, and re-inject the remaining plan into context; the file is durable plan and working memory.
  • complementsInterrupt-Resumable Thought·Preserve multi-step reasoning across interrupts by supporting paused-and-resumed thought frames so a new message handles cleanly without clobbering in-flight work.
  • complementsPartial-Output SalvageStream every model token to a tmp-plus-atomic-replace partial file so crashes mid-inference leave a consistent salvage, then promote partials at startup with a typed recovery marker the model can see.
  • complementsBlocking Sync Calls in Agent LoopAnti-pattern: run synchronous, blocking I/O inside the agent loop or HTTP handler, capping concurrency at the number of OS threads.
  • complementsStateless Reducer AgentDesign the agent as a pure function (state, event) → newState; entire execution history is held in an external event log; enables pause / resume / replay / time-travel without bespoke checkpointing.
  • complementsTest-Time Memorization (Titans)·Memory module that learns at inference time by incorporating recent inputs into its parameters during the session rather than relying solely on pre-trained weights.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Used in frameworks

Show 26 more

References

Provenance