Methodology · Evaluationemergingverified

Component Then Holistic Evaluation

also known as bottom-up agent eval, capability-then-system eval

Applies to: agentmulti-agent-systemcoding-agentbrowser-agent

Tags: component-evalholistic-evalbottom-upintegration

Test an agent from the bottom up. First score each ability on its own: tools, planning, memory, and learning. Then run full end-to-end scenarios to check the whole system. At that level you look for consistency, coherence, and made-up content. The per-ability tests catch problems close to where they start. The whole-system pass catches new problems that appear only when abilities that each pass on their own interact badly. Skipping either layer is a known mistake. Per-ability tests alone miss integration bugs. Whole-system tests alone cannot tell you which ability caused a bug.

Methodology process overview

Intent. Test an agent at two layers, per ability and end-to-end, so you catch bugs where they start and still surface the ones that only appear when abilities interact.

When to apply. Use this for any agent with several distinct abilities, such as tool use, planning, memory, and maybe online learning. It pays off when debugging full runs is expensive and you can exercise each ability on its own. Don't apply it for single-ability agents, such as a pure summariser or a pure classifier, where there is nothing to break apart. One whole-system test is enough there.

Inputs

  • Capability inventoryA list of the agent's distinct abilities: its tools, its planning module, its memory store, and its learning loop.
  • Per-capability eval setsInputs built to exercise one ability on its own, with the other abilities faked out by mocks or stubs.
  • End-to-end scenariosRealistic multi-step tasks that make several abilities work together.

Outputs

  • Per-capability scorecardsPass/fail or number scores for tools, planning, memory, and learning, each tested on its own.
  • Holistic scorecardEnd-to-end scores for consistency, coherence, and made-up content across realistic scenarios.

Steps (7)

  1. Decompose the agent into testable capabilities

    Split the agent into parts you can test on their own: picking and calling tools, planning quality, reading and writing memory correctly, and any learning loop.

  2. Evaluate tools in isolation

    Test each tool on its own with set inputs. Check the format is right, errors are handled, repeats are safe, and speed is fine. Fake the agent's reasoning so any failure points straight at the tool.

    usesTool UseDry-Run Harness

  3. Evaluate planning in isolation

    Score the planner on plan quality: completeness, the right order, and correct dependencies. Stub out tools and memory while you do it. A bad plan still gives bad runs even with good tools.

    usesPlan-and-ExecutePlanner-Executor-Verifier (PEV)

  4. Evaluate memory in isolation

    Test that memory writes, reads, keeps things, and pulls back the right items. Confirm the agent stores what it should and recalls it for the right queries.

    usesAgentic MemoryShort-Term Thread Memory

  5. Evaluate learning (if present) in isolation

    If the agent learns from feedback, run the learning loop with set traces. Confirm it improves on the cases you meant to fix and does not get worse on the others.

  6. Run end-to-end holistic scenarios

    Run realistic multi-step tasks that use tools, planning, and memory together. Score three things: consistency, meaning the same input gives a similar output; coherence, meaning the steps within a run hang together logically; and the rate of made-up content.

    usesEval Harness

  7. Diff component vs holistic results

    Compare the two layers. An ability that passes on its own but fails in the full run points to an integration bug. An ability that fails on its own but passes in the full run usually means the whole-system test is too easy. Both gaps tell you what to fix next.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Per-ability tests pin down where a bug is. Whole-system tests catch the ones that only appear together. You need both.
  • Isolate an ability by faking its neighbours, not by hoping the full-run test happens to exercise it.
  • Consistency, coherence, and made-up content are traits of the whole system, not of any one ability.
  • A whole-system-only test is quick to build but slow to debug. A per-ability-only test is the opposite.

Known failure modes (2)

Related patterns (5)

Related compositions (1)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified