Shadow Canary Bandit Rollout
also known as staged exposure rollout, shadow-to-bandit promotion
Roll out a change to an agent in stages that expose more users as confidence grows. First, run the new version next to the live one and compare them offline, so no user is affected (a shadow run). Next, send a small slice of real traffic to the new version at full risk (a canary). Last, let the system send more traffic to whichever version gets better results from real users. Each stage is a gate. If the numbers get worse at any stage, the rollout stops and the failing cases are saved for study. This works on one change at a time, such as a new prompt, model, or retrieval tweak, rather than on a whole action type.
Methodology process overview
Intent. Move an agent change through stages that widen exposure as results hold up. Run it in shadow, then on a small canary slice, then let traffic shift toward the better version. A drop in the numbers stops the rollout on its own.
When to apply. Use this for any live agent or LLM app where changes ship often and a bad change can hurt the user experience or cost. It works best when traffic is high, because the small canary slice needs enough volume to catch problems. Don't apply the traffic-shifting stage on low-volume systems, where the learning method (a multi-armed bandit) cannot gather enough signal. There, fall back to a plain shadow run plus a manual switch.
Inputs
- Production traffic stream — Live requests you can copy to the new version for a shadow run and split for the canary slice.
- Reward signal — A result you can measure on each request, such as a user thumbs-up, a completed task, or a downstream metric. The traffic-shifting stage uses this to decide which version to favor.
- Regression detector — Automatic checks that flag when a new version's reward or error rate falls below the current one.
Outputs
- Promotion log — A record you can audit. It shows which change passed which stage and on what evidence.
- Regression trace bundle — The failing cases saved from any stage that did not pass, ready for offline debugging.
- Traffic allocation policy — The current rule for how much traffic each version gets. This is the bandit's belief or the A/B split.
Steps (6)
Shadow the new build
Copy production traffic to the new version. Compare its outputs and metrics offline. No user is affected. This catches obvious crashes, slower responses, and large output drift before any user sees the change.
usesShadow Canary
Define the canary slice and exit criteria
Pick a small live slice, usually 1 to 5 percent. Write down what 'pass' means before you open it: the minimum sample size, the regression allowed on each metric, and the time window. Freeze these rules first.
Open the canary
Send the small slice to the new version at full risk. Watch the error rate, the latency, the reward signal, and how often guardrails trip.
Collect regression traces on any failure
If the canary misses any of its pass rules, stop the rollout and save the failing cases in a bundle. That bundle becomes a new test for the next round.
Promote to bandit or A/B
When the canary clears its bar, hand traffic control to a learning method that shifts traffic toward the better version (a Bayesian bandit), or a simple fixed A/B split. It gives the new version more traffic as its results pull ahead.
Watch the results and auto-rollback on regression
Keep watching the bandit's view of each version. If the new version's reward clearly drops below the current one, the bandit moves traffic away from it on its own. Add a firm rollback threshold to fully retire the build.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Exposure widens only on evidence. Every stage gates the next.
- A regression is a useful output. Save it as traces and fold it into the test set. Do not just roll back.
- A learning method (bandit) replaces a fixed A/B split once traffic is high enough. Below that, A/B or a canary with a manual switch is the honest choice.
- Every stage has a pass rule. Freeze it before the stage opens.
Known failure modes (3)
- ✕Demo-to-Production Cliff
Skipping shadow and canary because the new build 'looked great in dev' — the cliff is exactly what shadow is for.
- ✕Perma-Beta
Stage gates set so high that no change ever advances past canary, freezing the system in indefinite trial mode.
- ✕Errors Swept Under the Rug
Failing canaries quietly rolled back without capturing the regression trace — the same failure ships again next quarter.
Related patterns (5)
- ★★Shadow Canary
Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
- ★Bayesian Bandit Experimentation
Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.
- ★Scorer Live Monitoring
Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
- ★★Decision Log
Persist the agent's reasoning trace alongside its actions so post-hoc review can explain why.
- ★★Lineage Tracking
Track which prompt version, model version, and data sources produced each agent output.
Related compositions (2)
- recipe · abstract shapeProduction LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.
- recipe · abstract shapeEval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.
Related methodologies (2)
- Crawl-Walk-Run Automation Gating★★
Separate what an agent can do from what it is allowed to do on its own. A system that could plausibly act gets to act only after the data earns it, one action type at a time.
- Evaluation-Driven Development★★
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
Sources (2)
Building Applications with AI Agents
Ch 10 'Monitoring in Production' / Ch 11 'Improvement Loops' “Monitoring patterns including 'Shadow Mode' and 'Canary Deployments' ... 'Regression Trace Collection' ... Experimentation approaches: 'Shadow Deployments,' 'A/B Testing,' and 'Bayesian Bandits'”
LLM Canary Prompting in Production: Shadow Tests, Drift Alarms, and Safe Rollouts
“Treat prompt changes like deployments, and wrap them in the same safety rails you'd give any other production rollout.”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified