RL-Trained Conductor Orchestrator

also known as 指揮者モデル, Trained Conductor, Fugu Conductor, Self-Calling Orchestrator

Train a small meta-model with reinforcement learning to dispatch sub-tasks across a pool of frontier LLM workers, learning the communication topology end-to-end and allowing the conductor to recursively invoke itself as a worker.

This pattern helps complete certain larger patterns —

specialisesOrchestrator-Workers★★— An orchestrator dynamically breaks a task into subtasks at runtime and delegates each to a worker LLM, then synthesises results.

Context

A team operates a production multi-agent stack that dispatches sub-tasks across a heterogeneous pool of frontier large language models from different vendors — one strong at long-context summarisation, one at code synthesis, one at image understanding — plus a set of tools. The routing logic between them is usually a hand-written tree of if-this-then-that rules with prompt-time hints. Tasks span many domains and the pool of workers keeps changing as vendors release and deprecate models.

Problem

Hand-coded orchestrator logic does not generalise across the breadth of incoming tasks: static heuristics for which model gets which sub-task miss the task-specific signals that actually predict the right routing, and the rules grow stale every time the worker pool changes. Using a frontier model itself as the orchestrator is expensive on every dispatch step and still does not learn from the reward signal that finished tasks provide. There is no obvious place for the system to improve its own decomposition strategy from experience, so every gain in routing quality requires another round of human rule editing.

Forces

Routing decisions are task-dependent and the right worker for a sub-task is not knowable from static rules alone.
Frontier models are expensive to use as the always-on orchestrator on every dispatch step.
The worker pool changes — new models arrive, old ones are deprecated — and hand-coded routing must be rewritten each time.
Reward signal from task outcomes is available but unused by static orchestration.
Some sub-tasks are themselves decomposable, so the orchestrator must be able to recurse without infinite expansion.

Example

A product routes user tasks across four frontier models plus a code-execution tool. The team replaces its rule-based router with a 7B conductor trained on six months of task outcomes. The conductor learns that long-context summarisation goes to one vendor, code synthesis to another, image understanding to a third, and that some research tasks should be broken into three sub-tasks where the conductor recursively calls itself as the second-level planner. Average cost-per-task drops, and routing improves without anyone editing rules.

Diagram

sequenceDiagram participant U as User task participant C as Conductor (small RL meta-model) participant W as Worker pool (frontier LLMs, tools, conductor-as-worker) participant R as Reward signal U->>C: task loop until done or budget C->>W: sub-task instruction + worker id W-->>C: worker output C->>C: decide next move (continue, switch worker, recurse, stop) end C-->>U: final answer R-->>C: end-of-task reward (updates conductor policy only)

Solution

Therefore:

A small conductor model (often in the 7B–13B range) sits in front of a pool of worker LLMs and tools. On each step the conductor emits a natural-language sub-task instruction and a worker selection; the worker is run, its output returned, and the conductor decides the next move. The conductor is trained with reinforcement learning against final task rewards: it learns which workers handle which sub-task shapes, how to phrase the hand-off, when to stop, and when to recursively dispatch a sub-task back to itself as a worker. Recursion is bounded by a depth limit and a step budget. Workers remain frozen frontier models; only the conductor is trained.

What this pattern forbids. The conductor must respect a hard recursion-depth cap and a step budget on every task, must emit explicit sub-task instructions and worker selections rather than free-form thoughts, and must not invoke workers outside the registered pool — including its own untrained ancestor models.

And the patterns that stand alongside it, or against it —

alternative-toMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
alternative-toMixture of Experts Routing★— Route each request to one or more domain-expert agents, where each expert holds deep capability in a narrow area.
complementsAgent-as-Tool Embedding★— Wrap a sub-agent (with its own loop, prompt, and tool palette) behind a single function-shaped tool signature, so the parent agent calls it like any other tool and never sees the sub-agent's internal turns.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance

Source: patterns/rl-conductor-orchestrator.md on GitHub · commit 4314cd3 · view history
Added to catalog: 2026-05-19
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.