Methodology · Evaluationemergingneeds-verification

Evaluation Planning Framework

also known as trajectory-grounded MAS eval planning, MAS evaluation plan

Applies to: multi-agent-systemagentcoding-agentbrowser-agent

Tags: mastrajectory-evalplanningharness

Plan how to test a multi-agent system from its whole run, not just its final answer. The whole run, step by step, is called the trajectory. Turn expert know-how into a list of things to check. For each one, decide how to score it: a plain number, a code check, or a model judge. Then assemble those choices into a harness you can run. The result is an evaluation plan. It names what you measure at each step of the run and how you measure it, before you write any code.

Methodology process overview

Intent. Produce a runnable test harness for a multi-agent system whose checks, scoring methods, and step anchors are all chosen on purpose before you build it.

When to apply. Use this when you are designing a test for a multi-agent system, or any agent whose value is in how it acts step by step, not just its final answer. Examples: research agents, coding agents, browser agents, and planner-executor systems. Don't apply it for single-turn tasks such as classification or one-shot generation, where one output already shows the whole behaviour.

Inputs

  • Multi-agent system definitionThe roles, the way the agents talk to each other, and the tools the system can use.
  • Domain expertsPeople who can say what a correct run and a correct final result look like in this field.
  • Sample trajectoriesRecorded runs, both good and bad, that show what each check is scoring.

Outputs

  • Evaluation plan documentThe named checks, the scoring method for each one, the step in the run where each is measured, and the pass marks.
  • Executable evaluation harnessCode that runs the plan against new runs and returns scores.

Steps (6)

  1. Inventory what is worth measuring

    Talk to the experts and list the checks that tell a good run from a bad one. These include teamwork between agents, tool choice, final answer quality, cost, and speed. Do not squash everything down to one 'did the task succeed' score.

  2. Anchor criteria to trajectory points

    For each check, decide where it is scored. It might be on a middle step, such as a tool call, a handoff between agents, or a plan revision. Or it might be on the final output. Checks that look at middle steps need the system to record those steps.

  3. Choose a scoring method per criterion

    For each check, pick how to score it. Use a plain number for things like speed, cost, or how many tool calls were made. Use a code check for things like valid format or a pattern match. Use a model judge for open-ended quality. Most harnesses mix all three.

    usesLLM-as-JudgeSampled Prompt Trace Eval

  4. Assemble the harness

    Wire the chosen scorers into one pipeline you can run. It takes a run and returns the full set of scores. Version the plan and the harness together.

    usesEval Harness

  5. Calibrate against curated trajectories

    Run the harness on the sample runs. Check that good runs score high and bad runs score low. If they do not, the plan is wrong, not the system. Fix the plan before you measure anything new.

  6. Treat the plan as a frozen artefact

    Once it is calibrated, version the plan. Changing the plan makes your old scores no longer comparable and forces a fresh baseline, just like changing a rubric.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Plan the test before you build it. Building it shows you what is easy to measure, not what matters to measure.
  • For a multi-agent system, the unit you study is the whole run, not a single output.
  • Mix scoring methods on purpose. Numbers, code checks, and judges each have their own weak spots.
  • Check the harness on known-good and known-bad runs before you trust it.

Known failure modes (2)

Related patterns (4)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: needs-verification