Evaluation Planning Framework
also known as trajectory-grounded MAS eval planning, MAS evaluation plan
Plan how to test a multi-agent system from its whole run, not just its final answer. The whole run, step by step, is called the trajectory. Turn expert know-how into a list of things to check. For each one, decide how to score it: a plain number, a code check, or a model judge. Then assemble those choices into a harness you can run. The result is an evaluation plan. It names what you measure at each step of the run and how you measure it, before you write any code.
Methodology process overview
Intent. Produce a runnable test harness for a multi-agent system whose checks, scoring methods, and step anchors are all chosen on purpose before you build it.
When to apply. Use this when you are designing a test for a multi-agent system, or any agent whose value is in how it acts step by step, not just its final answer. Examples: research agents, coding agents, browser agents, and planner-executor systems. Don't apply it for single-turn tasks such as classification or one-shot generation, where one output already shows the whole behaviour.
Inputs
- Multi-agent system definition — The roles, the way the agents talk to each other, and the tools the system can use.
- Domain experts — People who can say what a correct run and a correct final result look like in this field.
- Sample trajectories — Recorded runs, both good and bad, that show what each check is scoring.
Outputs
- Evaluation plan document — The named checks, the scoring method for each one, the step in the run where each is measured, and the pass marks.
- Executable evaluation harness — Code that runs the plan against new runs and returns scores.
Steps (6)
Inventory what is worth measuring
Talk to the experts and list the checks that tell a good run from a bad one. These include teamwork between agents, tool choice, final answer quality, cost, and speed. Do not squash everything down to one 'did the task succeed' score.
Anchor criteria to trajectory points
For each check, decide where it is scored. It might be on a middle step, such as a tool call, a handoff between agents, or a plan revision. Or it might be on the final output. Checks that look at middle steps need the system to record those steps.
Choose a scoring method per criterion
For each check, pick how to score it. Use a plain number for things like speed, cost, or how many tool calls were made. Use a code check for things like valid format or a pattern match. Use a model judge for open-ended quality. Most harnesses mix all three.
Assemble the harness
Wire the chosen scorers into one pipeline you can run. It takes a run and returns the full set of scores. Version the plan and the harness together.
usesEval Harness
Calibrate against curated trajectories
Run the harness on the sample runs. Check that good runs score high and bad runs score low. If they do not, the plan is wrong, not the system. Fix the plan before you measure anything new.
Treat the plan as a frozen artefact
Once it is calibrated, version the plan. Changing the plan makes your old scores no longer comparable and forces a fresh baseline, just like changing a rubric.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Plan the test before you build it. Building it shows you what is easy to measure, not what matters to measure.
- For a multi-agent system, the unit you study is the whole run, not a single output.
- Mix scoring methods on purpose. Numbers, code checks, and judges each have their own weak spots.
- Check the harness on known-good and known-bad runs before you trust it.
Known failure modes (2)
- ✕Errors Swept Under the Rug
Scoring only the final output and missing trajectory-level failures (wrong sub-agent used, redundant tool calls) that the system would repeat at scale.
- ★Compound Error Degradation
Eval that aggregates per-step scores into one number hides which step in a long trajectory caused the regression.
Related patterns (4)
- ★★Eval Harness
Run a held-out dataset against agent versions to detect regressions and measure improvement.
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Sampled Prompt Trace Eval
Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
- ★Reasoning Trace Carry-Forward
For reasoning models that emit a separate reasoning trace, preserve that trace in context across the same logical task episode (across tool-call/result turns) but drop it at user-turn boundaries.
Related compositions (2)
- recipe · abstract shapeEval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.
- recipe · abstract shapeMulti-Agent Coordination
Several agents collaborate under a coordinator, with explicit hand-offs and a shared protocol. The shape behind LangGraph supervisor, OpenAI Swarm, AutoGen group chat, Bedrock multi-agent orchestrators.
Related methodologies (2)
Sources (2)
Designing Multi-Agent Systems
Ch 10 'Evaluating Multi-Agent Systems' (Evaluation-Driven Development; Multiagent Trajectories; Practical Evaluation Harness — sub-section labels not independently verified online) “What We're Evaluating: Multiagent Trajectories”
designing-multiagent-systems (Victor Dibia)
Ch 10 'Evaluating Multi-Agent Systems' “Chapter 10: 'Evaluating Multi-Agent Systems' with evaluation frameworks and LLM-as-judge patterns”
Provenance
- Added to catalog:
- Last updated:
- Verification status: needs-verification