Multi-Agent

RL-Trained Conductor Orchestrator

Train a small meta-model with reinforcement learning to dispatch sub-tasks across a pool of frontier LLM workers, learning the communication topology end-to-end and allowing the conductor to recursively invoke itself as a worker.

Problem

Hand-coded orchestrator logic does not generalise across the breadth of incoming tasks: static heuristics for which model gets which sub-task miss the task-specific signals that actually predict the right routing, and the rules grow stale every time the worker pool changes. Using a frontier model itself as the orchestrator is expensive on every dispatch step and still does not learn from the reward signal that finished tasks provide. There is no obvious place for the system to improve its own decomposition strategy from experience, so every gain in routing quality requires another round of human rule editing.

Solution

A small conductor model (often in the 7B–13B range) sits in front of a pool of worker LLMs and tools. On each step the conductor emits a natural-language sub-task instruction and a worker selection; the worker is run, its output returned, and the conductor decides the next move. The conductor is trained with reinforcement learning against final task rewards: it learns which workers handle which sub-task shapes, how to phrase the hand-off, when to stop, and when to recursively dispatch a sub-task back to itself as a worker. Recursion is bounded by a depth limit and a step budget. Workers remain frozen frontier models; only the conductor is trained.

When to use

A heterogeneous frontier-model worker pool is in production and routing matters.
Task-outcome rewards are observable at scale.
An RL training pipeline (or partner) is available.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related