Evaluation Planning Framework
Produce a runnable test harness for a multi-agent system whose checks, scoring methods, and step anchors are all chosen on purpose before you build it.
Description
Plan how to test a multi-agent system from its whole run, not just its final answer. The whole run, step by step, is called the trajectory. Turn expert know-how into a list of things to check. For each one, decide how to score it: a plain number, a code check, or a model judge. Then assemble those choices into a harness you can run. The result is an evaluation plan. It names what you measure at each step of the run and how you measure it, before you write any code.
When to apply
Use this when you are designing a test for a multi-agent system, or any agent whose value is in how it acts step by step, not just its final answer. Examples: research agents, coding agents, browser agents, and planner-executor systems. Don't apply it for single-turn tasks such as classification or one-shot generation, where one output already shows the whole behaviour.
What it involves
- Inventory what is worth measuring
- Anchor criteria to trajectory points
- Choose a scoring method per criterion
- Assemble the harness
- Calibrate against curated trajectories
- Treat the plan as a frozen artefact
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.