Agentic Context Engineering Playbook
also known as ACE, Delta-Patched Playbook, Generator-Reflector-Curator Triad, Item-Addressable Self-Improvement
Treat the agent's system prompt and long-lived memory as a structured, item-addressable playbook that evolves through small delta updates from a Generator/Reflector/Curator loop, so accumulated tactics resist the context collapse that monolithic rewrites cause.
This pattern helps complete certain larger patterns —
- specialisesReflexion·— Have the agent write linguistic lessons from past failures and consult them in future episodes.
- used-byRigor Relocation★— Relocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.
Context
A team operates an agent whose behaviour is shaped by a long-lived system prompt or a persistent memory file, and that prompt accumulates tactics, heuristics, and worked examples gathered across many runs over weeks or months. After every batch of tasks the team wants the agent to absorb what it learned, so they periodically ask the agent to reflect on its own runs and update the playbook in place. Each update needs to add new specific tactics without eroding the ones already there.
Problem
When self-reflection is free-form and the agent is asked to rewrite the whole playbook in one pass, each rewrite tends to paraphrase yesterday's concrete tactic into a vague generality and then drop it on the next pass. There is no addressable unit a reflection step can point at, so the playbook either bloats with near-duplicates or collapses into platitudes. Three different jobs (proposing a new lesson, judging whether it is correct, and deciding whether to keep it) all happen inside the same prompt, which produces vague output because the model cannot do all three jobs well at once. The team is forced to choose between losing accumulated specifics and letting the playbook grow unbounded.
Forces
- Playbooks must accumulate specific tactics, not just abstract principles, to remain useful.
- Monolithic rewrites lose item-level structure and tend toward generic phrasing each pass (context collapse).
- Some items are wrong, redundant, or stale and must be removable without disturbing the rest.
- Generation, evaluation, and curation are different jobs; collapsing them into one prompt produces vague output.
- The playbook must remain readable and auditable by humans, not become an opaque blob.
Example
A coding agent accumulates a playbook of testing tactics over months of runs. The team switches from whole-prompt rewrites to a three-role loop. After each task, the Generator proposes new items like 'before running pytest in this repo, install dev extras'; the Reflector compares the proposal against the run outcome and against existing items; the Curator adds it as item 47, edits item 12 (which was a vaguer version of the same tactic), and removes item 33 (which the Reflector flagged as wrong in two recent runs). The playbook keeps growing in specificity instead of decaying into generalities.
Diagram
Solution
Therefore:
The playbook is stored as an ordered list of items with stable identifiers; each item carries a short tactic, optional worked example, and provenance. A run produces a trajectory and outcome. The Generator reads the trajectory and proposes new candidate items as deltas. The Reflector reviews proposed and existing items against the outcome and recent history, scoring which to keep, edit, or drop. The Curator applies the resulting delta set — strictly add/edit/remove operations against item ids — with dedup against existing items. Whole-playbook rewrites are forbidden. The three roles are separate prompts (and may be separate model calls) so that generation cannot pre-empt evaluation, and evaluation cannot quietly drop items the Curator did not authorise.
What this pattern forbids. The Generator must only emit candidate item deltas, never rewrite the playbook; the Reflector must only score items, never edit them; the Curator must apply only add/edit/remove operations against existing item ids and must never replace the playbook wholesale; whole-prompt regeneration of the playbook is forbidden.
And the patterns that stand alongside it, or against it —
- alternative-toSelf-Refine★★— Iterate generate → feedback (same model) → refine until a stop criterion fires, with no separate critic model.
- complementsPrompt Versioning★★— Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
- complementsCluster-Capped Insight Store·— Cap the number of insights per stem-token cluster and archive the oldest variants by mtime so the long-term store keeps the active research edge instead of accumulating near-duplicates.
- alternative-toDSPy Signatures★— Specify agent behaviour as declarative typed signatures and modules; compile prompts and few-shot examples automatically against a metric.
- complementsPre-Flight Spec Authoring★— Before any code is generated, author a multi-pillar spec and have the agent critique it for ambiguity and edge cases, so that the loop executes against a reviewed target rather than a fresh prompt.
- complementsContext Window Dumb-Zone Cap★— Hold context-window utilization below a working threshold (~40%) to keep the model out of the 'dumb zone' where it begins ignoring earlier instructions and hallucinating.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.