Intermediate Artifact Evaluation

also known as Per-Pipeline-Node Eval, Mid-Pipeline Artifact Eval

Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.

Context

A team evaluates agent quality by measuring final output success. Final-output eval cannot tell which pipeline node failed when the output is wrong. Debugging requires manual trace inspection.

Problem

Final-output-only eval is coarse — it indicates something failed but not where. When pipelines have many nodes (plan, tools, guardrails, reflection), the team cannot improve any specific node without per-node signal. Differs from eval-harness (full-run eval) and eval-as-contract (boundary contract).

Forces

Per-artifact eval requires instrumenting each pipeline node to emit reviewable artifacts.
More eval points means more eval cost (LLM-as-judge calls, human review time).
Some intermediate artifacts are not naturally evaluable in isolation.

Example

A research agent's eval: pre-pattern, only 'did the report answer the question?' is measured (78% pass). Post-pattern: planning artifact, tool-call trace, citation-attribution, final report all eval'd. Planning 88%, tools 92%, citations 65%, report 78%. Citations is the worst-scoring node; improvement work targets that node first instead of guessing.

Diagram

flowchart TD Pipe[Agent pipeline] --> N1[Node 1: plan] N1 --> A1[Artifact: plan] Pipe --> N2[Node 2: tool calls] N2 --> A2[Artifact: trace] Pipe --> N3[Node 3: guardrail] N3 --> A3[Artifact: decision] A1 --> E1[Per-artifact eval] A2 --> E2[Per-artifact eval] A3 --> E3[Per-artifact eval] E1 --> Score[Per-node scores] E2 --> Score E3 --> Score

Solution

Therefore:

Each pipeline node emits a named artifact (plan, tool-call trace, guardrail decision, reflection output). Eval suite has per-artifact rubrics. Per-artifact pass/fail rates inform which node to improve. Pair with eval-harness, eval-as-contract, llm-as-judge, agent-evaluator, dual-evaluation-offline-online.

What this pattern forbids. Pipeline nodes must emit named, schema-defined artifacts; eval rubrics exist per artifact class.

And the patterns that stand alongside it, or against it —

complementsEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
complementsEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
complementsLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
complementsAgent Evaluator★— A dedicated agent or harness whose sole job is running tests against another agent's outputs to evaluate performance; distinct from eval-harness (offline batch) and llm-as-judge (per-output).
complementsDual Evaluation (Offline + Online)★— Run two parallel evaluation tracks — offline benchmark gates before deploy AND online production-traffic monitoring after — so drift is caught even when pre-deploy benchmarks pass.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

Arize Phoenix / Arize AX
first-class4 patternsEnterprise Platforms★★ mature
Arize evaluates intermediate agent artifacts — sending the ordered list of tool calls (the trajectory) to an LLM judge and scoring individual tool-calling spans — rather than only…

References

2025年の年始に読み直したいAIエージェントの設計原則とか実装パターン集
blog

Provenance

Source: patterns/artifact-evaluation.md on GitHub · commit 0f962e5 · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.