Shadow Canary

also known as Shadow Agent, Canary Deployment

Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.

Context

A team wants to roll out a new model, a tweaked prompt, or a reworked tool wiring to an agent already serving real users. They have an existing version (the champion) that they trust on live traffic and a candidate version (the challenger) they want to validate before promoting. The traffic distribution in production includes long-tail queries that no pre-release evaluation set fully captures.

Problem

Pre-release evaluations cover the distributions the team thought to put in the test set, not the surprising ones that show up in real usage. Releasing the challenger directly to a fraction of users exposes those users to whatever regressions it has. The team is forced to choose between launching blind and hoping nothing breaks, or building a separate evaluation set so comprehensive that it never actually matches live behaviour.

Forces

Shadow runs cost money for output never shown.
Comparison logic for free-form outputs is non-trivial.
Shadow latency must not affect the user-visible path.

Example

A team wants to upgrade the underlying model on an in-production agent but pre-release evals miss real-traffic regressions. They route ten percent of real traffic through both champion (current) and challenger (candidate); only champion's reply reaches the user. A judge model diffs the two on agreed metrics over a week. They catch a regression on a niche legal-style query that no eval covered, fix it, then promote the challenger.

Diagram

flowchart TD U[User request] --> Split[Traffic split] Split --> Champ[Champion agent] Split --> Chall[Challenger agent] Champ --> Resp[User response] Chall --> Log[Shadow log] Champ --> Diff[Diff metrics:<br/>judge / exact-match / latency / cost] Log --> Diff Diff --> Gate{Lift?} Gate -- yes --> Promote[Promote challenger] Gate -- regression --> Revert[Revert]

Solution

Therefore:

Route a fraction of real traffic through both champion and challenger. Champion's output reaches the user. Challenger's output is logged. Diff the outputs on agreed metrics (judge model, exact match on tool calls, latency, cost). Promote on lift; revert on regression.

What this pattern forbids. Challenger output is not user-visible during shadow; only logging.

The smaller patterns that complete this one —

usesLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

And the patterns that stand alongside it, or against it —

complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
alternative-toPerma-Beta✕— Anti-pattern: ship the agent in 'beta' indefinitely so that quality regressions are someone else's problem.
complementsEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
complementsPrompt Versioning★★— Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
complementsScorer Live Monitoring★— Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
alternative-toDemo-to-Production Cliff✕— Anti-pattern: ship a demo-validated agent straight into production without a frozen eval, cost ceiling, loop-detector, or named oncall, then act surprised when accuracy drops and cost runs away.
complementsDual Evaluation (Offline + Online)★— Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
complementsDemo-Production Cliff (Multi-Agent)✕— Anti-pattern: multi-agent pilot benchmarks at 95% accuracy / 2s latency on a curated demo set, then degrades to ~80% / 40s under realistic 10k-RPD load.
complementsContext Gap (Security)✕— Agents faithfully follow explicit security rules but miss the broader implications — they log access correctly without flagging the unusual pattern a human expert would catch immediately.
alternative-toBayesian Bandit Experimentation★— Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.
complements[crawl-walk-run-automation-gating]
complements[evaluation-driven-development]
complementsSampled Prompt Trace Eval★— Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
complementsProgressive Delegation★— Stage the human-to-agent handoff over time: the agent starts producing drafts a human always reviews; its autonomy expands action-by-action as measured trust accrues.
complementsTrust and Reputation Routing★— Maintain a per-agent reputation score updated from outcome quality and peer feedback, and route new tasks preferentially to high-reputation agents.
alternative-toPrompt Variant Evaluation★★— Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.

Shadow Canary

Context

Problem

Forces

Example

Diagram

Solution

Neighbourhood

Used in recipes

Used in frameworks

References

Provenance