Planning & Control Flow

Exploration vs Exploitation

Balance taking the best-known action (exploit) with trying alternatives that might be better (explore).

Problem

An agent that always picks whatever is currently the best-known option (pure exploitation) locks in at whatever local optimum it stumbled into early and never discovers that a different tool or template would have worked better. An agent that always tries something new (pure exploration) burns budget on unproven options and never compounds what it has already learned. Picking the trade-off informally — by gut feel or by occasional manual override — gives neither the predictable improvement of a scheduled policy nor the statistical guarantees that bandit theory provides.

Solution

Pick a strategy: epsilon-greedy (exploit with probability 1-ε), upper-confidence-bound (favor under-explored options with bonus), Thompson sampling (sample from posterior). Apply across tools, strategies, prompts. Track outcomes and adjust.

When to use

  • The agent chooses repeatedly among options (tools, strategies, prompts) and outcomes can be tracked.
  • Pure exploitation is locking the agent into local optima.
  • A strategy (epsilon-greedy, UCB, Thompson sampling) can be picked and tuned.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related