Shadow Canary
Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
Problem
Pre-release evaluations cover the distributions the team thought to put in the test set, not the surprising ones that show up in real usage. Releasing the challenger directly to a fraction of users exposes those users to whatever regressions it has. The team is forced to choose between launching blind and hoping nothing breaks, or building a separate evaluation set so comprehensive that it never actually matches live behaviour.
Solution
Route a fraction of real traffic through both champion and challenger. Champion's output reaches the user. Challenger's output is logged. Diff the outputs on agreed metrics (judge model, exact match on tool calls, latency, cost). Promote on lift; revert on regression.
When to use
- Agent changes are non-deterministic and CI cannot capture field behaviour.
- Real traffic can be replayed through a challenger without affecting users.
- A diff metric (judge model, exact match, latency) can be defined.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.
Related
- Eval Harness
- LLM-as-Judge
- Perma-Beta
- Eval as Contract
- Prompt Versioning
- Scorer Live Monitoring
- Demo-to-Production Cliff
- Dual Evaluation (Offline + Online)
- Demo-Production Cliff (Multi-Agent)
- Context Gap (Security)
- Bayesian Bandit Experimentation
- Sampled Prompt Trace Eval
- Progressive Delegation
- Trust and Reputation Routing
- Prompt Variant Evaluation