Methodology · Evaluation

Component Then Holistic Evaluation

Test an agent at two layers, per ability and end-to-end, so you catch bugs where they start and still surface the ones that only appear when abilities interact.

Description

Test an agent from the bottom up. First score each ability on its own: tools, planning, memory, and learning. Then run full end-to-end scenarios to check the whole system. At that level you look for consistency, coherence, and made-up content. The per-ability tests catch problems close to where they start. The whole-system pass catches new problems that appear only when abilities that each pass on their own interact badly. Skipping either layer is a known mistake. Per-ability tests alone miss integration bugs. Whole-system tests alone cannot tell you which ability caused a bug.

When to apply

Use this for any agent with several distinct abilities, such as tool use, planning, memory, and maybe online learning. It pays off when debugging full runs is expensive and you can exercise each ability on its own. Don't apply it for single-ability agents, such as a pure summariser or a pure classifier, where there is nothing to break apart. One whole-system test is enough there.

What it involves

  • Decompose the agent into testable capabilities
  • Evaluate tools in isolation
  • Evaluate planning in isolation
  • Evaluate memory in isolation
  • Evaluate learning (if present) in isolation
  • Run end-to-end holistic scenarios
  • Diff component vs holistic results

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related