Methodology · Evaluationemergingneeds-verification

Evaluation Planning Framework

also known as trajectory-grounded MAS eval planning, MAS evaluation plan

Applies to: multi-agent-systemagentcoding-agentbrowser-agent

Tags: mastrajectory-evalplanningharness

Plan how to test a multi-agent system from its whole run, not just its final answer. The whole run, step by step, is called the trajectory. Turn expert know-how into a list of things to check. For each one, decide how to score it: a plain number, a code check, or a model judge. Then assemble those choices into a harness you can run. The result is an evaluation plan. It names what you measure at each step of the run and how you measure it, before you write any code.

Methodology process overview

flowchart TD mas[MAS definition] --> s1[Inventory what to measure] experts[Domain experts] --> s1 s1 --> criteria[Named criteria] criteria --> s2[Anchor to trajectory points] traces[Sample trajectories] --> s2 s2 --> anchored[Criteria with trajectory anchors] anchored --> s3[Choose scoring method per criterion] s3 --> numeric[Numeric: latency, cost] s3 --> prog[Programmatic: schema, regex] s3 --> judge[LLM judge: open-ended] numeric --> s4[Assemble harness] prog --> s4 judge --> s4 s4 --> harness[(Versioned eval plan + runnable harness)] harness --> s5[Calibrate on curated trajectories] s5 --> good{Good runs score high?} good -->|no| s1 good -->|yes| s6[Freeze plan as artefact]

Intent. Produce a runnable test harness for a multi-agent system whose checks, scoring methods, and step anchors are all chosen on purpose before you build it.

When to apply. Use this when you are designing a test for a multi-agent system, or any agent whose value is in how it acts step by step, not just its final answer. Examples: research agents, coding agents, browser agents, and planner-executor systems. Don't apply it for single-turn tasks such as classification or one-shot generation, where one output already shows the whole behaviour.

Example scenario

A research-engineering team is shipping a multi-agent literature-review system. A planner agent breaks down a research question. Three search-specialist agents query arXiv, PubMed, and Semantic Scholar at the same time. A synthesiser agent merges what they find. A verifier agent flags claims that are not supported. A test that scored only the final summary would miss the real problem. Most of the team's debugging time was going into bad handoffs between agents and repeated searches. They ran a planning workshop with two researchers. Out came the checks: (1) plan completeness, meaning did the planner cover all the sub-questions; (2) search efficiency, meaning how many duplicate queries ran across the specialists; (3) coverage breadth, meaning did the merged result cite enough distinct sources; (4) verifier precision, meaning the false-positive rate on flagged claims; (5) final-summary quality; plus speed and cost. They tied each check to a point in the run. Check 1 sits on the planner output. Check 2 spans all the specialist calls. Check 3 sits on the synthesiser before its output. Check 4 sits on the verifier output. Check 5 sits on the final answer. Scoring methods: 1 a model judge against the question, 2 a code check that removes duplicate query strings, 3 a code count of citations, 4 a model judge against the retrieved passages, 5 a model judge. They built it as a single harness keyed on the run id. They ran it on 12 known-good and 8 known-bad recorded runs from staging. The first version scored a known-bad run too high. The search-efficiency code check used exact-string matching, and the bad run had reworded queries that meant the same thing. They tightened the check, re-calibrated, and only then froze the plan as v1 for measuring new builds.

Inputs

Multi-agent system definition — The roles, the way the agents talk to each other, and the tools the system can use.
Domain experts — People who can say what a correct run and a correct final result look like in this field.
Sample trajectories — Recorded runs, both good and bad, that show what each check is scoring.

Outputs

Evaluation plan document — The named checks, the scoring method for each one, the step in the run where each is measured, and the pass marks.
Executable evaluation harness — Code that runs the plan against new runs and returns scores.

Steps (6)

Inventory what is worth measuring
Talk to the experts and list the checks that tell a good run from a bad one. These include teamwork between agents, tool choice, final answer quality, cost, and speed. Do not squash everything down to one 'did the task succeed' score.
Anchor criteria to trajectory points
For each check, decide where it is scored. It might be on a middle step, such as a tool call, a handoff between agents, or a plan revision. Or it might be on the final output. Checks that look at middle steps need the system to record those steps.
Choose a scoring method per criterion
For each check, pick how to score it. Use a plain number for things like speed, cost, or how many tool calls were made. Use a code check for things like valid format or a pattern match. Use a model judge for open-ended quality. Most harnesses mix all three.
usesLLM-as-Judge Sampled Prompt Trace Eval
Assemble the harness
Wire the chosen scorers into one pipeline you can run. It takes a run and returns the full set of scores. Version the plan and the harness together.
usesEval Harness
Calibrate against curated trajectories
Run the harness on the sample runs. Check that good runs score high and bad runs score low. If they do not, the plan is wrong, not the system. Fix the plan before you measure anything new.
Treat the plan as a frozen artefact
Once it is calibrated, version the plan. Changing the plan makes your old scores no longer comparable and forces a fresh baseline, just like changing a rubric.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Plan the test before you build it. Building it shows you what is easy to measure, not what matters to measure.
For a multi-agent system, the unit you study is the whole run, not a single output.
Mix scoring methods on purpose. Numbers, code checks, and judges each have their own weak spots.
Check the harness on known-good and known-bad runs before you trust it.

Evaluation Planning Framework

Methodology process overview

Steps (6)

Inventory what is worth measuring

Anchor criteria to trajectory points

Choose a scoring method per criterion

Assemble the harness

Calibrate against curated trajectories

Treat the plan as a frozen artefact

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (4)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance