Component Then Holistic Evaluation
also known as bottom-up agent eval, capability-then-system eval
Test an agent from the bottom up. First score each ability on its own: tools, planning, memory, and learning. Then run full end-to-end scenarios to check the whole system. At that level you look for consistency, coherence, and made-up content. The per-ability tests catch problems close to where they start. The whole-system pass catches new problems that appear only when abilities that each pass on their own interact badly. Skipping either layer is a known mistake. Per-ability tests alone miss integration bugs. Whole-system tests alone cannot tell you which ability caused a bug.
Methodology process overview
Intent. Test an agent at two layers, per ability and end-to-end, so you catch bugs where they start and still surface the ones that only appear when abilities interact.
When to apply. Use this for any agent with several distinct abilities, such as tool use, planning, memory, and maybe online learning. It pays off when debugging full runs is expensive and you can exercise each ability on its own. Don't apply it for single-ability agents, such as a pure summariser or a pure classifier, where there is nothing to break apart. One whole-system test is enough there.
Inputs
- Capability inventory — A list of the agent's distinct abilities: its tools, its planning module, its memory store, and its learning loop.
- Per-capability eval sets — Inputs built to exercise one ability on its own, with the other abilities faked out by mocks or stubs.
- End-to-end scenarios — Realistic multi-step tasks that make several abilities work together.
Outputs
- Per-capability scorecards — Pass/fail or number scores for tools, planning, memory, and learning, each tested on its own.
- Holistic scorecard — End-to-end scores for consistency, coherence, and made-up content across realistic scenarios.
Steps (7)
Decompose the agent into testable capabilities
Split the agent into parts you can test on their own: picking and calling tools, planning quality, reading and writing memory correctly, and any learning loop.
Evaluate tools in isolation
Test each tool on its own with set inputs. Check the format is right, errors are handled, repeats are safe, and speed is fine. Fake the agent's reasoning so any failure points straight at the tool.
Evaluate planning in isolation
Score the planner on plan quality: completeness, the right order, and correct dependencies. Stub out tools and memory while you do it. A bad plan still gives bad runs even with good tools.
Evaluate memory in isolation
Test that memory writes, reads, keeps things, and pulls back the right items. Confirm the agent stores what it should and recalls it for the right queries.
Evaluate learning (if present) in isolation
If the agent learns from feedback, run the learning loop with set traces. Confirm it improves on the cases you meant to fix and does not get worse on the others.
Run end-to-end holistic scenarios
Run realistic multi-step tasks that use tools, planning, and memory together. Score three things: consistency, meaning the same input gives a similar output; coherence, meaning the steps within a run hang together logically; and the rate of made-up content.
usesEval Harness
Diff component vs holistic results
Compare the two layers. An ability that passes on its own but fails in the full run points to an integration bug. An ability that fails on its own but passes in the full run usually means the whole-system test is too easy. Both gaps tell you what to fix next.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Per-ability tests pin down where a bug is. Whole-system tests catch the ones that only appear together. You need both.
- Isolate an ability by faking its neighbours, not by hoping the full-run test happens to exercise it.
- Consistency, coherence, and made-up content are traits of the whole system, not of any one ability.
- A whole-system-only test is quick to build but slow to debug. A per-ability-only test is the opposite.
Known failure modes (2)
Related patterns (5)
- ★★Eval Harness
Run a held-out dataset against agent versions to detect regressions and measure improvement.
- ★★Plan-and-Execute
Plan all the steps once with a strong model, then execute each step with a cheaper model under the plan.
- ★★Tool Use
Let the LLM produce typed calls against an external toolkit instead of producing free-form text the surrounding system has to parse.
- ★Agentic Memory
Expose memory management as first-class tool actions (ADD, UPDATE, DELETE, RETRIEVE, SUMMARY, FILTER) the LLM chooses at every step, trained end-to-end so short-term and long-term memory live under one learned policy.
- ★Dry-Run Harness
Simulate planned actions (and their projected side effects) without committing them, surfacing a reviewable diff before any commit.
Related compositions (1)
Related methodologies (2)
- Evaluation-Driven Development★★
Judge every prompt change, model swap, search tweak, and new tool against a test you committed to up front, not by feel.
- Evaluation Planning Framework★
Produce a runnable test harness for a multi-agent system whose checks, scoring methods, and step anchors are all chosen on purpose before you build it.
Sources (2)
Building Applications with AI Agents
Ch 9 'Validation and Measurement' “Component Evaluation ... Evaluating Tools ... Evaluating Planning ... Evaluating Memory ... Evaluating Learning ... Holistic Evaluation ... Performance in End-to-End Scenarios ... Consistency ... Coherence ... Hallucination”
Building Applications with AI Agents — O'Reilly catalogue (Ch 9 TOC mirror)
Ch 9 'Validation and Measurement' “Measuring Agentic Systems ... Component Evaluation ... Holistic Evaluation”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified