Methodology · Iteration Managementprovenverified

Crawl-Walk-Run Automation Gating

also known as progressive autonomy, autonomy tiers

Applies to: agentcoding-agentmulti-agent-system

Tags: autonomy-tiersprogressive-rolloutgating

Roll an agent out in three stages, with a clear gate between each one. In the first stage the agent only suggests, and a person acts. In the second stage the agent acts on internal staff, who can fix mistakes. In the third stage the agent acts directly on outside customers. Each stage is set per action type, not for the whole agent. The same agent can be in the last stage for safe read-only actions and the first stage for refunds. To move up a stage, an action type must clear a published metric bar. If its numbers drop, it moves back down on its own.

Methodology process overview

stateDiagram-v2 [*] --> Crawl: every action type starts here Crawl --> Walk: clears acceptance-rate bar for hold duration Walk --> Run: clears customer-outcome bar for hold duration Walk --> Crawl: regression below Crawl-Walk bar (with hysteresis) Run --> Walk: regression below Walk-Run customer-outcome bar Run --> Crawl: severe regression / incident-driven demote note right of Crawl Agent emits suggestions only Human accepts / rejects / edits end note note right of Walk Agent acts on internal staff Internal staff can correct end note note right of Run Agent acts on external customers Customer-outcome metric monitored end note

Intent. Separate what an agent can do from what it is allowed to do on its own. A system that could plausibly act gets to act only after the data earns it, one action type at a time.

When to apply. Use this for any agent that could plausibly act on its own in ways customers feel. Examples are replying to tickets, refunding orders, sending outbound messages, or changing production resources. The right fit is when one bad action does real harm and there is no reliable way to undo it. Don't apply it for read-only or sandboxed agents, where a bad action causes no harm.

Example scenario

A mid-size retailer's customer-support team is deploying an AI agent that can answer questions, look up orders, and issue refunds. They enumerate four action types up front: 'reply to FAQ-style ticket', 'lookup order status' (read-only), 'issue refund up to $50', 'issue refund above $50'. They publish per-tier bars: Crawl-Walk requires 90% human acceptance over 200 actions in a two-week window; Walk-Run requires a customer-outcome metric (re-open rate within 7 days < 4%) over 500 actions in three weeks. On day one every action type sits at Crawl. The agent drafts; humans accept, edit, or reject. After two weeks, FAQ replies hit 94% acceptance and lookup hits 99% — both promote to Walk. Lookup also clears the customer-outcome bar within a further three weeks (re-open rate flat) and promotes to Run. FAQ replies sit at Walk longer because the re-open rate creeps up to 5% on edge cases involving order modifications; the action type stays at Walk pending fixes. The refund actions move separately. Refund-up-to-$50 reaches Walk after a month. Refund-above-$50 stays at Crawl indefinitely by team policy, regardless of acceptance, because the blast radius justifies a permanent human-in-the-loop. Six weeks in, a prompt change inadvertently regressed FAQ acceptance to 81%; the system auto-demoted FAQ back to Crawl, the team rolled back the prompt, and a week later FAQ returned to Walk. Nobody framed the demotion as a failure — it was the gate doing exactly what it was designed to do.

Inputs

Catalog of action types — A list of every distinct action the agent can take. Each one has its own level of risk if it goes wrong.
Promotion bar per tier — Published targets an action type must hit to move up a stage. These cover human acceptance, internal completion, and customer outcomes.
Per-action-type metric pipeline — Logging that records, for every action, which stage it ran at and what happened next.

Outputs

Per-action-type tier assignment — A live map from each action type to its current stage (Crawl, Walk, or Run).
Promotion log — A record you can audit. It shows which action types moved up or down, when, and on what evidence.

Steps (6)

Enumerate action types
List every distinct action the agent can take. 'Reply to ticket', 'issue refund up to $50', 'issue refund above $50', and 'escalate to human' are four separate types, not one.
Publish the metric bar per tier
For each stage, write down the metric and the target an action type must hit to move up. Use acceptance rate, completion rate, or customer outcome. Also set how long each stage must hold before you consider promotion.
Start every action type at Crawl
On day one, no action type starts in Walk or Run, no matter how it was built. In Crawl the agent only suggests. A human accepts, rejects, or edits each suggestion.
Promote one action type at a time
When an action type clears the Crawl-to-Walk bar, move up that one type only. The agent now acts on internal staff for that action. Everything else stays in Crawl.
Watch for regression and auto-demote
If a stage's metric drops below its bar, move that action type back down. Use a small margin so it does not bounce up and down. A demotion is not a failure. It is the system working as designed.
Advance to Run with the customer-outcome metric
Moving from Walk to Run needs more than internal acceptance. It needs a real customer outcome, such as a ticket that stays resolved, a refund that is not reversed, or a message that is not flagged. Internal acceptance is not the same as customer success.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Autonomy belongs to action types, not to the agent as a whole.
Every stage publishes both a metric bar and a hold time. The hold time stops one good week from promoting a risky action.
Promote one action at a time. Demote automatically.
Internal acceptance is not customer success. The Walk-to-Run gate uses a different metric from the Crawl-to-Walk gate.

Known failure modes (2)

Related patterns (3)

Related compositions (2)

Related methodologies (1)

Evaluation-Driven Development★★
6 steps
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.

Sources (3)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (6)

Enumerate action types

Publish the metric bar per tier

Start every action type at Crawl

Promote one action type at a time

Watch for regression and auto-demote

Advance to Run with the customer-outcome metric