X · Governance & ObservabilityEmerging

Durable Workflow Snapshot

also known as Workflow Checkpointing, Storage-Backed Workflow State, Snapshot Persistence

Capture workflow execution state as a snapshot in a pluggable storage provider so a paused run can resume across deployments, process restarts, and host crashes.

This pattern helps complete certain larger patterns —

  • specialisesAgent Resumption★★Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects.
  • used-byInterruptible Agent ExecutionTreat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.

Context

A team builds workflows that may run for hours or days and that frequently pause waiting on external signals: a human approving a loan, a slow third-party API returning a result, or a scheduled wake-up the next morning. These workflows have to keep running across application deploys, restarts of the worker processes, and the loss of individual hosts. The team has access to durable storage such as a Postgres database, an object store, or a vendor-managed snapshot service.

Problem

Keeping the workflow state only in process memory is enough to survive a single crash that the same process recovers from, but not deploys that replace the binary, host failures that move work elsewhere, or pauses long enough that the original worker is gone. Without writing the full state out to durable storage at known checkpoints, every deploy or host loss vaporises in-flight runs and the work restarts from zero. The team is forced to choose between short workflows that fit in one process lifetime or accepting that long-running workflows will routinely lose hours of progress.

Forces

  • Workflow state grows with run length and must be serialisable to durable storage.
  • Storage providers vary in latency, cost, and consistency guarantees.
  • Schema versioning across deployments — a v1 snapshot may need to resume under v2 code.
  • Snapshot frequency trades resume granularity against write cost.
  • Snapshots are sensitive data; access control on the storage provider is part of the threat model.

Example

A loan-origination agent runs for hours, pausing twice for human approval. Without durable snapshots, every nightly deploy kills in-flight runs and the work restarts from zero the next morning. The team adds durable workflow snapshots written to Postgres after each step: on deploy, in-flight runs resume from their last checkpoint, the awaited approval is rehydrated, and the worst-case loss is one step. Snapshot schemas are versioned; the new deploy refuses to resume a snapshot it cannot understand and emits an explicit recovery task instead.

Diagram

Solution

Therefore:

Treat the workflow runtime as a state machine whose state is fully serialisable. At checkpoints (after every step, on suspend, before risky actions) write a snapshot — `{step_index, local_state, awaited_signals, history}` — to a pluggable storage provider (Postgres, S3, Redis, vendor-managed). To resume, load the snapshot, rehydrate state, and continue from the recorded step. Version snapshot schemas; refuse to resume incompatible versions rather than corrupt the run. Pair with agent-resumption (the broader pattern), replay-time-travel (the auditor view), and provenance-ledger (linking snapshots to outputs).

What this pattern forbids. Workflow state must be fully serialisable into the storage provider at every checkpoint; no in-process-only data may participate in resumption, and snapshots are not allowed to resume under incompatible schema versions.

And the patterns that stand alongside it, or against it —

  • complementsReplay / Time-Travel★★Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.
  • complementsProvenance Ledger★★Log every agent decision and state change with enough metadata to explain or reverse it later.
  • complementsScheduled Agent★★Run the agent on a fixed schedule independent of user requests.
  • complementsBlocking Sync Calls in Agent LoopAnti-pattern: run synchronous, blocking I/O inside the agent loop or HTTP handler, capping concurrency at the number of OS threads.
  • complementsMissing Idempotency on Agent CallsAnti-pattern: retry state-mutating agent tool calls without idempotency keys, so retries multiply real-world side effects.
  • complementsOrchestrator as BottleneckAnti-pattern: route all agent runs through a single-process orchestrator that becomes the system-wide concurrency ceiling.
  • complementsStateless Reducer AgentDesign the agent as a pure function (state, event) → newState; entire execution history is held in an external event log; enables pause / resume / replay / time-travel without bespoke checkpointing.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.