Dual Evaluation (Offline + Online)

also known as Offline+Online Eval Bands, Pre-Deploy + Post-Deploy Eval

Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Context

A team evaluates agent quality. Common patterns: (a) offline eval only — benchmark before deploy, then nothing; (b) online monitoring only — react to production signal but cannot gate deploys.

Problem

Offline-only eval cannot catch drift between benchmark traffic and production traffic. Online-only eval cannot prevent bad deploys. Either alone misses failure modes the other catches.

Forces

Two eval tracks means two infrastructures to maintain.
Offline and online may disagree (different traffic shapes), creating triage burden.
Online monitoring requires sampling and labeling discipline.

Example

A support agent's offline eval is 200 hand-curated tickets (88% pass). Deploy gate: ≥85%. Passes. Online: rolling 7-day pass rate on production sample (LLM-as-judge + weekly human spot-check). Week 2 online drops to 81%. Track disagreement: offline didn't catch a new traffic class (account-merging questions) that production has. Benchmark refreshed to include the new class.

Diagram

flowchart LR Code[New agent build] --> Off[Offline benchmark] Off -->|pass| Deploy[Deploy] Off -->|fail| Block[Block] Deploy --> Prod[Production traffic] Prod --> On[Online sampling + judging] On --> Alarm[Alert on regression] On --> Refresh[Refresh offline benchmark periodically]

Solution

Therefore:

Offline track: a curated benchmark suite that runs pre-deploy; gates rollout on score. Online track: production traffic sampling with delayed labeling (human review, LLM-as-judge); rolling metrics with alerting. Disagreement between offline pass and online regression is itself a signal — indicates benchmark-vs-production gap. Pair with eval-harness, artifact-evaluation, shadow-canary, scorer-live-monitoring.

What this pattern forbids. No deploy without offline gate pass AND no live system without online monitoring; both tracks have defined thresholds and alerting.

And the patterns that stand alongside it, or against it —

complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
complementsShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
complementsScorer Live Monitoring★— Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
complementsIntermediate Artifact Evaluation★— Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
complementsAgent Evaluator★— A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
complementsCompliance-Certified Launch Gate★— Require an external regulator to certify the generative service against a published content-safety standard before it may serve the public, forcing the standard's controls into the build as a re-certifiable artifact.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

References

2025年の年始に読み直したいAIエージェントの設計原則とか実装パターン集
blog

Provenance

Source: patterns/dual-evaluation-offline-online.md on GitHub · commit 0f962e5 · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.