X · Governance & ObservabilityEmerging

Scorer Live Monitoring

also known as Live Evaluation, Production Scoring, Async Output Scorers

Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.

Context

A team runs an agent that handles real user traffic and wants a continuous read on output quality, not just a snapshot at release time. The product has a tight latency budget — users will notice if every reply waits an extra second on a scoring model. Quality matters across several dimensions at once: helpfulness judged by another model, forbidden phrases checked programmatically, similarity to a curated reference, and rubric-based checks.

Problem

Pre-release evaluations on a fixed held-out dataset only cover distributions the team thought of in advance and say nothing about what real traffic looks like today. Closed-loop approaches that re-run the model whenever a score is low double latency and cost for every request, even though most outputs are fine. The team is forced to choose between flying blind on live quality, paying the latency tax of inline scoring, or running expensive batch analyses long after the bad reply has already reached the user.

Forces

  • Live quality data is the only honest signal that production matches lab.
  • Blocking the response on a judge model doubles latency and cost.
  • Async scorers can fall behind during traffic spikes and need back-pressure.
  • Open-loop scoring is informational only — the user already saw the output by the time the score lands.
  • Multiple scorer kinds (LLM judge, programmatic check, embedding-similarity, rubric) emit on different timescales.

Example

A customer-support agent serves thousands of replies per hour. The team wants a continuous read on reply quality but cannot afford to block each reply on a judge model. They emit every reply onto a scoring stream that three workers consume: an LLM-judge scorer for helpfulness, a programmatic scorer for forbidden phrases, and an embedding-similarity scorer against a curated reference set. Scores populate a dashboard with rolling p50/p95; replies below threshold flow into a human-review queue but the user has already seen the reply. When the LLM-judge score drops sharply after a model swap, the team rolls back before the next deploy.

Diagram

Solution

Therefore:

After the agent returns to the user, publish `{request_id, input, output, context}` to a scoring stream. Independent scorer workers consume the stream and emit `{request_id, scorer, score, evidence}` records. Scorers may be LLM judges, programmatic checks, embedding-similarity to a reference, or rubric checks. Aggregate scores into dashboards and alert rules; route low scores into a re-evaluation queue rather than triggering re-generation in the user's request path. Distinct from evaluator-optimizer (which closes the loop by re-prompting on failure) and from eval-harness (which scores on a fixed set, not live traffic).

What this pattern forbids. Scorers do not run in the user's request path and may not modify or regenerate the agent's output; the user-visible response must not block on a scorer.

The smaller patterns that complete this one —

  • usesLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
  • usesAgent-as-a-JudgeEvaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.

And the patterns that stand alongside it, or against it —

  • complementsEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.
  • alternative-toEvaluator-Optimizer★★One LLM generates; another evaluates and feeds back; loop until criteria are met.
  • complementsShadow Canary★★Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
  • complementsRigor RelocationRelocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.
  • complementsDual Evaluation (Offline + Online)Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance