Methodology · Iteration Managementprovenverified

Shadow Canary Bandit Rollout

also known as staged exposure rollout, shadow-to-bandit promotion

Applies to: agentllm-apprag-systemcoding-agent

Tags: shadowcanarybanditrolloutexperimentation

Roll out a change to an agent in stages that expose more users as confidence grows. First, run the new version next to the live one and compare them offline, so no user is affected (a shadow run). Next, send a small slice of real traffic to the new version at full risk (a canary). Last, let the system send more traffic to whichever version gets better results from real users. Each stage is a gate. If the numbers get worse at any stage, the rollout stops and the failing cases are saved for study. This works on one change at a time, such as a new prompt, model, or retrieval tweak, rather than on a whole action type.

Methodology process overview

stateDiagram-v2 [*] --> Shadow: candidate change submitted Shadow --> Canary: shadow diff acceptable, no latency/error regression Shadow --> Halted: gross output drift or crash Canary --> Bandit: clears canary exit criteria on 1-5% slice Canary --> Halted: any exit criterion violated Bandit --> Promoted: posterior reward sustained above incumbent Bandit --> Halted: posterior reward falls below incumbent with confidence Halted --> Shadow: fixed build resubmitted Halted --> [*]: change abandoned Promoted --> [*] note right of Halted Regression traces captured Folded into eval set end note

Intent. Move an agent change through stages that widen exposure as results hold up. Run it in shadow, then on a small canary slice, then let traffic shift toward the better version. A drop in the numbers stops the rollout on its own.

When to apply. Use this for any live agent or LLM app where changes ship often and a bad change can hurt the user experience or cost. It works best when traffic is high, because the small canary slice needs enough volume to catch problems. Don't apply the traffic-shifting stage on low-volume systems, where the learning method (a multi-armed bandit) cannot gather enough signal. There, fall back to a plain shadow run plus a manual switch.

Example scenario

A team running a high-volume customer-support chatbot has a new prompt that they believe handles refund-policy questions better. They submit it as a candidate change. Shadow stage: production traffic is mirrored to the new prompt for 48 hours; outputs are diffed offline against the incumbent. Latency comes in 6ms lower, no crashes, output drift modest and concentrated exactly on refund-policy queries (the targeted improvement). Shadow passes. Canary stage: 2% of live traffic routes to the new prompt with the frozen exit criteria — minimum 4,000 conversations, no >0.5% absolute error-rate regression, no >5% latency p95 regression, guardrail-trip count not increasing. After 36 hours the criteria are met on every axis, including a 3-point lift on the refund-question subset. Canary passes. Allocation is handed to a Bayesian bandit with the new prompt and the incumbent as the two arms; reward is binary thumbs-up from the user. Within five days the bandit has shifted to 78% allocation to the new prompt because its posterior reward is sustained 4 points above the incumbent. A week later, an unrelated retrieval change is canary-tested and fails — the canary error rate spiked because the retriever returned empty results on a class of queries the shadow comparison hadn't surfaced. The build is halted, the failing traces are bundled and added to the regression eval set, and the change goes back for rework. The team didn't roll back quietly — the trace bundle is exactly what's needed to write the next round of tests, so the failure produces a permanent improvement to the harness rather than just an embarrassed Slack thread.

Inputs

Production traffic stream — Live requests you can copy to the new version for a shadow run and split for the canary slice.
Reward signal — A result you can measure on each request, such as a user thumbs-up, a completed task, or a downstream metric. The traffic-shifting stage uses this to decide which version to favor.
Regression detector — Automatic checks that flag when a new version's reward or error rate falls below the current one.

Outputs

Promotion log — A record you can audit. It shows which change passed which stage and on what evidence.
Regression trace bundle — The failing cases saved from any stage that did not pass, ready for offline debugging.
Traffic allocation policy — The current rule for how much traffic each version gets. This is the bandit's belief or the A/B split.

Steps (6)

Shadow the new build
Copy production traffic to the new version. Compare its outputs and metrics offline. No user is affected. This catches obvious crashes, slower responses, and large output drift before any user sees the change.
usesShadow Canary
Define the canary slice and exit criteria
Pick a small live slice, usually 1 to 5 percent. Write down what 'pass' means before you open it: the minimum sample size, the regression allowed on each metric, and the time window. Freeze these rules first.
Open the canary
Send the small slice to the new version at full risk. Watch the error rate, the latency, the reward signal, and how often guardrails trip.
usesShadow Canary Scorer Live Monitoring
Collect regression traces on any failure
If the canary misses any of its pass rules, stop the rollout and save the failing cases in a bundle. That bundle becomes a new test for the next round.
usesDecision Log Lineage Tracking
Promote to bandit or A/B
When the canary clears its bar, hand traffic control to a learning method that shifts traffic toward the better version (a Bayesian bandit), or a simple fixed A/B split. It gives the new version more traffic as its results pull ahead.
usesBayesian Bandit Experimentation
Watch the results and auto-rollback on regression
Keep watching the bandit's view of each version. If the new version's reward clearly drops below the current one, the bandit moves traffic away from it on its own. Add a firm rollback threshold to fully retire the build.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Exposure widens only on evidence. Every stage gates the next.
A regression is a useful output. Save it as traces and fold it into the test set. Do not just roll back.
A learning method (bandit) replaces a fixed A/B split once traffic is high enough. Below that, A/B or a canary with a manual switch is the honest choice.
Every stage has a pass rule. Freeze it before the stage opens.

Shadow Canary Bandit Rollout

Methodology process overview

Steps (6)

Shadow the new build

Define the canary slice and exit criteria

Open the canary

Collect regression traces on any failure

Promote to bandit or A/B

Watch the results and auto-rollback on regression

Framework-specific instructions

Principles

Known failure modes (3)

Related patterns (5)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance