Methodology · Evaluationemergingverified

Component Then Holistic Evaluation

also known as bottom-up agent eval, capability-then-system eval

Applies to: agentmulti-agent-systemcoding-agentbrowser-agent

Tags: component-evalholistic-evalbottom-upintegration

Test an agent from the bottom up. First score each ability on its own: tools, planning, memory, and learning. Then run full end-to-end scenarios to check the whole system. At that level you look for consistency, coherence, and made-up content. The per-ability tests catch problems close to where they start. The whole-system pass catches new problems that appear only when abilities that each pass on their own interact badly. Skipping either layer is a known mistake. Per-ability tests alone miss integration bugs. Whole-system tests alone cannot tell you which ability caused a bug.

Methodology process overview

flowchart TD agent[Agent under test] --> s1[Decompose into capabilities] s1 --> tools_c[Tool selection / invocation] s1 --> plan_c[Planning] s1 --> mem_c[Memory] s1 --> learn_c[Learning] tools_c --> s2[Evaluate tools - mock reasoning] plan_c --> s3[Evaluate planning - stub tools and memory] mem_c --> s4[Evaluate memory - controlled traces] learn_c --> s5[Evaluate learning loop] s2 --> comp[Per-capability scorecards] s3 --> comp s4 --> comp s5 --> comp scen[End-to-end scenarios] --> s6[Run holistic scenarios] agent --> s6 s6 --> hol[Holistic scorecard - consistency, coherence, hallucination] comp --> s7[Diff component vs holistic] hol --> s7 s7 --> int{Component pass + holistic fail?} int -->|yes| bug[Integration bug] int -->|no| s7b{Component fail + holistic pass?} s7b -->|yes| weak[Holistic eval too easy] s7b -->|no| ok[Both layers green]

Intent. Test an agent at two layers, per ability and end-to-end, so you catch bugs where they start and still surface the ones that only appear when abilities interact.

When to apply. Use this for any agent with several distinct abilities, such as tool use, planning, memory, and maybe online learning. It pays off when debugging full runs is expensive and you can exercise each ability on its own. Don't apply it for single-ability agents, such as a pure summariser or a pure classifier, where there is nothing to break apart. One whole-system test is enough there.

Example scenario

A travel-booking agent has four named abilities. It has flight and hotel search tools, eight functions in total. It has a planner that orders the search-compare-book steps. It has a memory store that holds the user's seat and meal preferences from past trips. And it has a learning loop that adjusts preference weights from user thumbs-up or thumbs-down on suggestions. The team had been running only end-to-end scenarios, such as book a trip from A to B. They were drowning in 'the agent picked a weird hotel' tickets they could not pin on any one part. So they added a per-ability layer. For tools on their own, each search tool was hit with 40 input cases, including bad city codes, empty result sets, and rate-limit responses. Two tools were unsafe to retry and got fixed. For planning on its own, the planner ran on 25 scripted user requests with tool calls and memory reads stubbed to canned responses. The plan-quality judge found that on multi-leg trips the planner skipped the layover-feasibility check 30% of the time. For memory on its own, 50 read, write, and recall scenarios confirmed the store handled preferences correctly. But its ranking for 'similar past trip' was poor on city pairs the user had visited only once. The whole-system test then ran 80 end-to-end scenarios. It scored consistency, meaning the same user and the same request run twice, at 87% agreement. It scored coherence, meaning the planner-to-tool-to-output flow, graded by a judge. And it scored made-up content, meaning claims about hotel amenities not in the source, at 4.2%. The key finding: two scenarios passed every per-ability test but failed in the full run. The planner correctly called the memory tool, and the memory tool returned correct preferences. But the planner worded the memory query in a way that triggered the weak ranking. That integration bug was invisible to either layer on its own. The gap drove a one-week fix that raised whole-system consistency to 94%.

Inputs

Capability inventory — A list of the agent's distinct abilities: its tools, its planning module, its memory store, and its learning loop.
Per-capability eval sets — Inputs built to exercise one ability on its own, with the other abilities faked out by mocks or stubs.
End-to-end scenarios — Realistic multi-step tasks that make several abilities work together.

Outputs

Per-capability scorecards — Pass/fail or number scores for tools, planning, memory, and learning, each tested on its own.
Holistic scorecard — End-to-end scores for consistency, coherence, and made-up content across realistic scenarios.

Steps (7)

Decompose the agent into testable capabilities
Split the agent into parts you can test on their own: picking and calling tools, planning quality, reading and writing memory correctly, and any learning loop.
Evaluate tools in isolation
Test each tool on its own with set inputs. Check the format is right, errors are handled, repeats are safe, and speed is fine. Fake the agent's reasoning so any failure points straight at the tool.
usesTool Use Dry-Run Harness
Evaluate planning in isolation
Score the planner on plan quality: completeness, the right order, and correct dependencies. Stub out tools and memory while you do it. A bad plan still gives bad runs even with good tools.
usesPlan-and-Execute Planner-Executor-Verifier (PEV)
Evaluate memory in isolation
Test that memory writes, reads, keeps things, and pulls back the right items. Confirm the agent stores what it should and recalls it for the right queries.
usesAgentic Memory Short-Term Thread Memory
Evaluate learning (if present) in isolation
If the agent learns from feedback, run the learning loop with set traces. Confirm it improves on the cases you meant to fix and does not get worse on the others.
Run end-to-end holistic scenarios
Run realistic multi-step tasks that use tools, planning, and memory together. Score three things: consistency, meaning the same input gives a similar output; coherence, meaning the steps within a run hang together logically; and the rate of made-up content.
usesEval Harness
Diff component vs holistic results
Compare the two layers. An ability that passes on its own but fails in the full run points to an integration bug. An ability that fails on its own but passes in the full run usually means the whole-system test is too easy. Both gaps tell you what to fix next.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Per-ability tests pin down where a bug is. Whole-system tests catch the ones that only appear together. You need both.
Isolate an ability by faking its neighbours, not by hoping the full-run test happens to exercise it.
Consistency, coherence, and made-up content are traits of the whole system, not of any one ability.
A whole-system-only test is quick to build but slow to debug. A per-ability-only test is the opposite.

Known failure modes (2)

Related patterns (5)

Related compositions (1)

recipe · abstract shape
Eval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.

Related methodologies (2)

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (7)

Decompose the agent into testable capabilities

Evaluate tools in isolation

Evaluate planning in isolation

Evaluate memory in isolation

Evaluate learning (if present) in isolation

Run end-to-end holistic scenarios

Diff component vs holistic results