X · Governance & ObservabilityEmerging

Intermediate Artifact Evaluation

also known as Per-Pipeline-Node Eval, Mid-Pipeline Artifact Eval

Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.

Context

A team evaluates agent quality by measuring final output success. Final-output eval cannot tell which pipeline node failed when the output is wrong. Debugging requires manual trace inspection.

Problem

Final-output-only eval is coarse — it indicates something failed but not where. When pipelines have many nodes (plan, tools, guardrails, reflection), the team cannot improve any specific node without per-node signal. Differs from eval-harness (full-run eval) and eval-as-contract (boundary contract).

Forces

  • Per-artifact eval requires instrumenting each pipeline node to emit reviewable artifacts.
  • More eval points means more eval cost (LLM-as-judge calls, human review time).
  • Some intermediate artifacts are not naturally evaluable in isolation.

Example

A research agent's eval: pre-pattern, only 'did the report answer the question?' is measured (78% pass). Post-pattern: planning artifact, tool-call trace, citation-attribution, final report all eval'd. Planning 88%, tools 92%, citations 65%, report 78%. Citations is the worst-scoring node; improvement work targets that node first instead of guessing.

Diagram

Solution

Therefore:

Each pipeline node emits a named artifact (plan, tool-call trace, guardrail decision, reflection output). Eval suite has per-artifact rubrics. Per-artifact pass/fail rates inform which node to improve. Pair with eval-harness, eval-as-contract, llm-as-judge, agent-evaluator, dual-evaluation-offline-online.

What this pattern forbids. Pipeline nodes must emit named, schema-defined artifacts; eval rubrics exist per artifact class.

And the patterns that stand alongside it, or against it —

  • complementsEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.
  • complementsEval as Contract★★Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
  • complementsLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
  • complementsAgent EvaluatorA dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
  • complementsDual Evaluation (Offline + Online)Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.