IX · Routing & CompositionExperimental·

Automatic Workflow Search

also known as AFlow, Workflow Synthesis, MCTS over Agent Graphs

Treat the agent's workflow (a graph of LLM-invoking nodes) as an artefact to search; use Monte Carlo Tree Search guided by an eval benchmark to discover the best workflow, then deploy it.

Context

A team is building an agent for a repeatable task domain such as competitive coding, mathematical problem solving, or question answering, where each output can be scored automatically against a benchmark of known answers. They are choosing how to compose the agent out of named building blocks like a router, a planner, an ensembler, a reviewer, and a revise step, but no one on the team knows in advance which arrangement of these blocks will perform best on the target task.

Problem

When the workflow shape is chosen by a human designer, the choice is biased toward whatever patterns the designer has seen before, and exploring even a handful of alternatives by hand is slow and expensive. Each candidate workflow has to be implemented, run end-to-end against the benchmark, and compared, so the search space the team actually covers is a tiny fraction of the realistic compositions. The result is workflows that work but are almost certainly not the best the model and tools could deliver.

Forces

  • There is a combinatorial space of workflows.
  • Each workflow run costs money to evaluate.
  • Search needs a signal (benchmark scores) plus an explore/exploit policy.
  • Workflows have to be representable as code or as a graph for search to work.

Example

A research lab has built six different agent workflows for a maths-olympiad benchmark — chain-of-thought, debate, planner-executor, and so on — and none consistently wins. Hand-tuning the next variant is slow and biased toward what the team already knows. They treat each workflow as a graph of LLM-invoking nodes and let an MCTS search explore variations, scoring each candidate against the benchmark. After a few thousand evaluations the search returns a workflow shape no one on the team had drafted, and it ships.

Diagram

Solution

Therefore:

Represent each candidate workflow as code or a graph of nodes (router, planner, ensemble, review, revise, executor). Use MCTS — selection by UCB-style scoring on past benchmark performance, expansion by code mutations or graph edits, simulation by running the workflow on the eval set, backpropagation of scores. After a search budget, deploy the best-scoring workflow. Use a library of operators (Ensemble, Review, Revise) to constrain the search space.

What this pattern forbids. No workflow may be deployed that was not measured against the held-out eval set; ad-hoc human edits to a discovered workflow re-enter the search.

The smaller patterns that complete this one —

  • usesEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.

And the patterns that stand alongside it, or against it —

  • complementsEval as Contract★★Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
  • complementsLanguage Agent Tree Search·Lift the agent loop into a search tree with a learned value function and backtracking.
  • alternative-toSpec-First AgentDrive the agent loop from a human-authored specification document rather than free-form prompts.
  • complementsBest-of-N SamplingSample N candidate outputs and select the highest-ranked by a reward model or scorer.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.