Governance & Observability

Shadow Canary

Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.

Problem

Pre-release evaluations cover the distributions the team thought to put in the test set, not the surprising ones that show up in real usage. Releasing the challenger directly to a fraction of users exposes those users to whatever regressions it has. The team is forced to choose between launching blind and hoping nothing breaks, or building a separate evaluation set so comprehensive that it never actually matches live behaviour.

Solution

Route a fraction of real traffic through both champion and challenger. Champion's output reaches the user. Challenger's output is logged. Diff the outputs on agreed metrics (judge model, exact match on tool calls, latency, cost). Promote on lift; revert on regression.

When to use

  • Agent changes are non-deterministic and CI cannot capture field behaviour.
  • Real traffic can be replayed through a challenger without affecting users.
  • A diff metric (judge model, exact match, latency) can be defined.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related