X · Governance & ObservabilityEmerging★

Journaled LLM Call

Record the output of every non-deterministic step on first execution and replay that recorded value during crash-recovery instead of re-invoking the model.

Context

A team runs an agent on a durable-execution engine that survives crashes by replaying the workflow from a recorded history. The workflow drives non-deterministic steps: LLM calls, tool results, timestamps, and random draws. The engine reconstructs in-memory state after a restart by re-running the workflow code up to the point it died, then resuming. For that reconstruction to be correct, every step the workflow code re-executes during replay must yield the same value it produced on the original run.

Problem

An LLM call is not a pure function: the same prompt returns a different completion on the next invocation, and a timestamp or random draw changes every time. If the durable engine re-invokes the model during replay, the recovered run diverges from the original — a tool gets called with arguments the first run never produced, a branch is taken that never happened, and the workflow history no longer matches reality. This replay divergence corrupts state silently and is hard to detect, because each individual call looks valid. Re-invoking also pays the model cost and latency a second time for work that already completed.

Forces

Replay correctness demands that re-executed steps return identical values, but model calls are non-deterministic by construction.
A recorded LLM response may be stale relative to the world, yet a fresh response breaks determinism.
Re-invoking the model on every recovery doubles token cost and latency for work already done.
Journaling adds storage and a write on the hot path of each non-deterministic step.

Example

A durable research agent calls an LLM to pick a search query, runs the search as a tool, then summarises. The worker crashes after the summary tool call but before persisting the next step. On recovery the engine replays the workflow: without journaling it would re-prompt the model, get a different query, and summarise the wrong results. With journaled calls the engine returns the original query and search result from the journal, recomputes only the deterministic control flow, and resumes exactly where it left off.

Diagram

flowchart TD Step[Non-deterministic step<br/>LLM / tool / time / random] --> Q{Journal entry<br/>for this step?} Q -- No, first run --> Inv[Invoke model/resource] --> Rec[Append output to journal] --> Use[Use output] Q -- Yes, replay --> Read[Read journaled output] --> Use Use --> Det[Deterministic workflow logic] --> Next[Next step]

Solution

Therefore:

Classify every step as deterministic workflow logic or non-deterministic effect. Run each effect — LLM call, tool invocation, timestamp read, random draw — exactly once and append its result to an append-only journal keyed by step position. On crash-recovery the engine replays the workflow code from the start; deterministic logic recomputes freely, but each effect call short-circuits to its journaled output instead of re-invoking the underlying resource. The model is queried only the first time a given step is reached; thereafter the recorded response stands in for it. This trades a possibly-stale recorded answer for deterministic, fault-tolerant replay and avoids paying the call cost twice.

What this pattern forbids. On replay the workflow must not re-invoke the model, clock, or RNG; the journaled output is replayed in place of a fresh call.

And the patterns that stand alongside it, or against it —

complementsDurable Workflow Snapshot★— Capture workflow execution state as a snapshot in a pluggable storage provider so a paused run can resume across deployments, process restarts, and host crashes.
complementsAgent Resumption★★— Persist agent execution state so a long-running run survives restarts, deploys, or user disconnects.
alternative-toReplay / Time-Travel★★— Re-run a past agent trace from any step with modified inputs/prompts/tools to debug or branch.
complementsDeterminism-Tiered Replay Gate·— Classify an agent into a reproducibility tier by re-running identical inputs, require the strictest decision-determinism tier for regulated decisions, and gate deployment and validation-sample size on the measured tier.
complementsReplay Divergence✕— Anti-pattern: treat an append-only event log whose consumers are LLMs as deterministically replayable, so replaying it under a changed model or prompt reconstructs different downstream events than the original run.