Governance & Observability

Scorer Live Monitoring

Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.

Problem

Pre-release evaluations on a fixed held-out dataset only cover distributions the team thought of in advance and say nothing about what real traffic looks like today. Closed-loop approaches that re-run the model whenever a score is low double latency and cost for every request, even though most outputs are fine. The team is forced to choose between flying blind on live quality, paying the latency tax of inline scoring, or running expensive batch analyses long after the bad reply has already reached the user.

Solution

After the agent returns to the user, publish `{request_id, input, output, context}` to a scoring stream. Independent scorer workers consume the stream and emit `{request_id, scorer, score, evidence}` records. Scorers may be LLM judges, programmatic checks, embedding-similarity to a reference, or rubric checks. Aggregate scores into dashboards and alert rules; route low scores into a re-evaluation queue rather than triggering re-generation in the user's request path. Distinct from evaluator-optimizer (which closes the loop by re-prompting on failure) and from eval-harness (which scores on a fixed set, not live traffic).

When to use

  • Production quality must be observed continuously, not just at release.
  • Latency budget on the user path does not allow a blocking judge call.
  • Multiple scorer kinds (LLM judge, programmatic check, embedding similarity) should run side by side.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related