Governance & Observability

Intermediate Artifact Evaluation

Evaluate intermediate artifacts (plans, tool-call traces, guardrail reactions) not only final outputs; isolates failure to a specific pipeline node.

Problem

Final-output-only eval is coarse — it indicates something failed but not where. When pipelines have many nodes (plan, tools, guardrails, reflection), the team cannot improve any specific node without per-node signal. Differs from eval-harness (full-run eval) and eval-as-contract (boundary contract).

Solution

Each pipeline node emits a named artifact (plan, tool-call trace, guardrail decision, reflection output). Eval suite has per-artifact rubrics. Per-artifact pass/fail rates inform which node to improve. Pair with eval-harness, eval-as-contract, llm-as-judge, agent-evaluator, dual-evaluation-offline-online.

When to use

  • Multi-node pipelines where failure attribution matters.
  • Engineering capacity for per-node instrumentation and rubrics.
  • Improvement work benefits from targeted node-level signal.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related