X · Governance & ObservabilityEmerging

Dual Evaluation (Offline + Online)

also known as Offline+Online Eval Bands, Pre-Deploy + Post-Deploy Eval

Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Context

A team evaluates agent quality. Common patterns: (a) offline eval only — benchmark before deploy, then nothing; (b) online monitoring only — react to production signal but cannot gate deploys.

Problem

Offline-only eval cannot catch drift between benchmark traffic and production traffic. Online-only eval cannot prevent bad deploys. Either alone misses failure modes the other catches.

Forces

  • Two eval tracks means two infrastructures to maintain.
  • Offline and online may disagree (different traffic shapes), creating triage burden.
  • Online monitoring requires sampling and labeling discipline.

Example

A support agent's offline eval is 200 hand-curated tickets (88% pass). Deploy gate: ≥85%. Passes. Online: rolling 7-day pass rate on production sample (LLM-as-judge + weekly human spot-check). Week 2 online drops to 81%. Track disagreement: offline didn't catch a new traffic class (account-merging questions) that production has. Benchmark refreshed to include the new class.

Diagram

Solution

Therefore:

Offline track: a curated benchmark suite that runs pre-deploy; gates rollout on score. Online track: production traffic sampling with delayed labeling (human review, LLM-as-judge); rolling metrics with alerting. Disagreement between offline pass and online regression is itself a signal — indicates benchmark-vs-production gap. Pair with eval-harness, artifact-evaluation, shadow-canary, scorer-live-monitoring.

What this pattern forbids. No deploy without offline gate pass AND no live system without online monitoring; both tracks have defined thresholds and alerting.

And the patterns that stand alongside it, or against it —

  • complementsEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.
  • complementsShadow Canary★★Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
  • complementsScorer Live MonitoringScore agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
  • complementsIntermediate Artifact EvaluationEvaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.
  • complementsAgent EvaluatorA dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.