Governance & Observability

Dual Evaluation (Offline + Online)

Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Problem

Offline-only eval cannot catch drift between benchmark traffic and production traffic. Online-only eval cannot prevent bad deploys. Either alone misses failure modes the other catches.

Solution

Offline track: a curated benchmark suite that runs pre-deploy; gates rollout on score. Online track: production traffic sampling with delayed labeling (human review, LLM-as-judge); rolling metrics with alerting. Disagreement between offline pass and online regression is itself a signal — indicates benchmark-vs-production gap. Pair with eval-harness, artifact-evaluation, shadow-canary, scorer-live-monitoring.

When to use

  • Production deployment of consequential agents.
  • Both pre-deploy gating and post-deploy monitoring matter.
  • Engineering capacity for two eval infrastructures.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related