Methodology · Evaluation

Component Then Holistic Evaluation

Test an agent at two layers, per ability and end-to-end, so you catch bugs where they start and still surface the ones that only appear when abilities interact.

Description

Test an agent from the bottom up. First score each ability on its own: tools, planning, memory, and learning. Then run full end-to-end scenarios to check the whole system. At that level you look for consistency, coherence, and made-up content. The per-ability tests catch problems close to where they start. The whole-system pass catches new problems that appear only when abilities that each pass on their own interact badly. Skipping either layer is a known mistake. Per-ability tests alone miss integration bugs. Whole-system tests alone cannot tell you which ability caused a bug.

When to apply

Use this for any agent with several distinct abilities, such as tool use, planning, memory, and maybe online learning. It pays off when debugging full runs is expensive and you can exercise each ability on its own. Don't apply it for single-ability agents, such as a pure summariser or a pure classifier, where there is nothing to break apart. One whole-system test is enough there.

What it involves

Decompose the agent into testable capabilities
Evaluate tools in isolation
Evaluate planning in isolation
Evaluate memory in isolation
Evaluate learning (if present) in isolation
Run end-to-end holistic scenarios
Diff component vs holistic results

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related