Durable Workflow Snapshot

also known as Workflow Checkpointing, Storage-Backed Workflow State, Snapshot Persistence

Capture workflow execution state as a snapshot in a pluggable storage provider so a paused run can resume across deployments, process restarts, and host crashes.

This pattern helps complete certain larger patterns —

specialisesAgent Resumption★★— Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects.
used-byInterruptible Agent Execution★— Treat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.

Context

A team builds workflows that may run for hours or days and that frequently pause waiting on external signals: a human approving a loan, a slow third-party API returning a result, or a scheduled wake-up the next morning. These workflows have to keep running across application deploys, restarts of the worker processes, and the loss of individual hosts. The team has access to durable storage such as a Postgres database, an object store, or a vendor-managed snapshot service.

Problem

Keeping the workflow state only in process memory is enough to survive a single crash that the same process recovers from, but not deploys that replace the binary, host failures that move work elsewhere, or pauses long enough that the original worker is gone. Without writing the full state out to durable storage at known checkpoints, every deploy or host loss vaporises in-flight runs and the work restarts from zero. The team is forced to choose between short workflows that fit in one process lifetime or accepting that long-running workflows will routinely lose hours of progress.

Forces

Workflow state grows with run length and must be serialisable to durable storage.
Storage providers vary in latency, cost, and consistency guarantees.
Schema versioning across deployments — a v1 snapshot may need to resume under v2 code.
Snapshot frequency trades resume granularity against write cost.
Snapshots are sensitive data; access control on the storage provider is part of the threat model.

Example

A loan-origination agent runs for hours, pausing twice for human approval. Without durable snapshots, every nightly deploy kills in-flight runs and the work restarts from zero the next morning. The team adds durable workflow snapshots written to Postgres after each step: on deploy, in-flight runs resume from their last checkpoint, the awaited approval is rehydrated, and the worst-case loss is one step. Snapshot schemas are versioned; the new deploy refuses to resume a snapshot it cannot understand and emits an explicit recovery task instead.

Diagram

sequenceDiagram participant W as Workflow Engine participant S as Storage Provider W->>W: run step 1 W->>S: snapshot(state_1) W->>W: run step 2 W->>S: snapshot(state_2) Note over W: host crashes / deploy W->>S: load(run_id) S-->>W: state_2 W->>W: resume from step 3

Solution

Therefore:

Treat the workflow runtime as a state machine whose state is fully serialisable. At checkpoints (after every step, on suspend, before risky actions) write a snapshot — `{step_index, local_state, awaited_signals, history}` — to a pluggable storage provider (Postgres, S3, Redis, vendor-managed). To resume, load the snapshot, rehydrate state, and continue from the recorded step. Version snapshot schemas; refuse to resume incompatible versions rather than corrupt the run. Pair with agent-resumption (the broader pattern), replay-time-travel (the auditor view), and provenance-ledger (linking snapshots to outputs).

What this pattern forbids. Workflow state must be fully serialisable into the storage provider at every checkpoint; no in-process-only data may participate in resumption, and snapshots are not allowed to resume under incompatible schema versions.

And the patterns that stand alongside it, or against it —

complementsReplay / Time-Travel★★— Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.
complementsProvenance Ledger★★— Log every agent decision and state change with enough metadata to explain or reverse it later.
complementsScheduled Agent★★— Run the agent on a fixed schedule independent of user requests.
complementsBlocking Sync Calls in Agent Loop✕— Anti-pattern: run synchronous, blocking I/O inside the agent loop or HTTP handler, capping concurrency at the number of OS threads.
complementsMissing Idempotency on Agent Calls✕— Anti-pattern: retry state-mutating agent tool calls without idempotency keys, so retries multiply real-world side effects.
complementsOrchestrator as Bottleneck✕— Anti-pattern: route all agent runs through a single-process orchestrator that becomes the system-wide concurrency ceiling.
complementsStateless Reducer Agent★— Design the agent as a pure function (state, event) → newState; entire execution history is held in an external event log; enables pause / resume / replay / time-travel without bespoke checkpointing.
complementsJournaled LLM Call★— Record the output of every non-deterministic step on first execution and replay that recorded value during crash-recovery instead of re-invoking the model.
alternative-toShadow Workspace★— Mirror the workspace into an isolated, version-controlled shadow where the agent makes and reverts edits, surfacing diffs for review and promoting only accepted changes to the real tree.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Agent Runtime Cross-Cutting
optional

Used in frameworks

References

Provenance

Source: patterns/durable-workflow-snapshot.md on GitHub · commit 7965435 · view history
Added to catalog: 2026-05-20
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.