Dual Evaluation (Offline + Online)
also known as Offline+Online Eval Bands, Pre-Deploy + Post-Deploy Eval
Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.
Context
A team evaluates agent quality. Common patterns: (a) offline eval only — benchmark before deploy, then nothing; (b) online monitoring only — react to production signal but cannot gate deploys.
Problem
Offline-only eval cannot catch drift between benchmark traffic and production traffic. Online-only eval cannot prevent bad deploys. Either alone misses failure modes the other catches.
Forces
- Two eval tracks means two infrastructures to maintain.
- Offline and online may disagree (different traffic shapes), creating triage burden.
- Online monitoring requires sampling and labeling discipline.
Example
A support agent's offline eval is 200 hand-curated tickets (88% pass). Deploy gate: ≥85%. Passes. Online: rolling 7-day pass rate on production sample (LLM-as-judge + weekly human spot-check). Week 2 online drops to 81%. Track disagreement: offline didn't catch a new traffic class (account-merging questions) that production has. Benchmark refreshed to include the new class.
Diagram
Solution
Therefore:
Offline track: a curated benchmark suite that runs pre-deploy; gates rollout on score. Online track: production traffic sampling with delayed labeling (human review, LLM-as-judge); rolling metrics with alerting. Disagreement between offline pass and online regression is itself a signal — indicates benchmark-vs-production gap. Pair with eval-harness, artifact-evaluation, shadow-canary, scorer-live-monitoring.
What this pattern forbids. No deploy without offline gate pass AND no live system without online monitoring; both tracks have defined thresholds and alerting.
And the patterns that stand alongside it, or against it —
- complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
- complementsScorer Live Monitoring★— Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
- complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
- complementsAgent Evaluator★— A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.