Dual Evaluation (Offline + Online)
Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
Problem
Offline-only eval cannot catch drift between benchmark traffic and production traffic. Online-only eval cannot prevent bad deploys. Either alone misses failure modes the other catches.
Solution
Offline track: a curated benchmark suite that runs pre-deploy; gates rollout on score. Online track: production traffic sampling with delayed labeling (human review, LLM-as-judge); rolling metrics with alerting. Disagreement between offline pass and online regression is itself a signal — indicates benchmark-vs-production gap. Pair with eval-harness, artifact-evaluation, shadow-canary, scorer-live-monitoring.
When to use
- Production deployment of consequential agents.
- Both pre-deploy gating and post-deploy monitoring matter.
- Engineering capacity for two eval infrastructures.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.