Orchestrator as Bottleneck
also known as Single-Process Scheduler Bottleneck, Centralized Orchestrator Cap
Anti-pattern: route all agent runs through a single-process orchestrator that becomes the system-wide concurrency ceiling.
Context
A team adopts a workflow engine or supervisor pattern early and runs it as a single process. Workers scale horizontally, but the orchestrator is one box managing state, dispatching events, and tracking run progress.
Problem
The orchestrator becomes the load-bearing single point of contention. Practical scaling ceiling sits around 10–100 concurrent workflows depending on how chatty the orchestrator is. Adding workers does not help; they queue waiting for orchestrator decisions. The fix is structural (sharded orchestrator, event-driven dispatch, or stateless-reducer per workflow) and expensive to retrofit once business logic depends on the centralized view.
Forces
- Centralized orchestrators are dramatically easier to reason about, debug, and visualize.
- Sharding orchestration breaks naive global views (cross-workflow queries become expensive).
- The bottleneck only shows up at scale, after the architecture is hard to change.
Example
A team builds a multi-agent system on a single Python supervisor process. Works fine for 30 concurrent workflows. At 200 concurrent workflows the supervisor pegs CPU dispatching events; workers idle waiting for assignments. Adding workers does nothing. The fix is sharding the supervisor by tenant id, which requires rewriting all cross-tenant analytics queries that assumed a single in-memory view.
Diagram
Solution
Therefore:
Partition orchestrator state by run id, tenant, or workflow type. Use durable event stores (Kafka, Temporal, Postgres logical replication) so multiple orchestrator replicas can subscribe independently. Where a single global view is needed, build it as a materialized projection of the event log, not as the orchestrator's local state. Pair with stateless-reducer-agent so each workflow can be rehydrated on any replica.
What this pattern forbids. No useful constraint; the missing constraint is horizontally partitionable orchestration from day one.
And the patterns that stand alongside it, or against it —
- complementsStateless Reducer Agent★— Design the agent as a pure function (state, event) → newState; entire execution history is held in an external event log; enables pause / resume / replay / time-travel without bespoke checkpointing.
- alternative-toEvent-Driven Agent★★— Trigger the agent on external events (webhooks, message queues, file changes) instead of user requests or schedules.
- complementsDurable Workflow Snapshot★— Capture workflow execution state as a snapshot in a pluggable storage provider so a paused run can resume across deployments, process restarts, and host crashes.
- complementsBlocking Sync Calls in Agent Loop✕— Anti-pattern: run synchronous, blocking I/O inside the agent loop or HTTP handler, capping concurrency at the number of OS threads.
- alternative-toSupervisor★★— Place a coordinating agent above a set of specialised agents and route work to them.
- complementsInfrastructure Burst Bottleneck (Agent Scale-Out)✕— Anti-pattern: deploy agents whose scale-out behavior triggers sudden data-and-compute bursts that on-prem or under-provisioned cloud infrastructure cannot absorb; agents work at small scale and freeze in production.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.