Methodology · Evaluation

Evaluation Planning Framework

Produce a runnable test harness for a multi-agent system whose checks, scoring methods, and step anchors are all chosen on purpose before you build it.

Description

Plan how to test a multi-agent system from its whole run, not just its final answer. The whole run, step by step, is called the trajectory. Turn expert know-how into a list of things to check. For each one, decide how to score it: a plain number, a code check, or a model judge. Then assemble those choices into a harness you can run. The result is an evaluation plan. It names what you measure at each step of the run and how you measure it, before you write any code.

When to apply

Use this when you are designing a test for a multi-agent system, or any agent whose value is in how it acts step by step, not just its final answer. Examples: research agents, coding agents, browser agents, and planner-executor systems. Don't apply it for single-turn tasks such as classification or one-shot generation, where one output already shows the whole behaviour.

What it involves

Inventory what is worth measuring
Anchor criteria to trajectory points
Choose a scoring method per criterion
Assemble the harness
Calibrate against curated trajectories
Treat the plan as a frozen artefact

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related