Exploration vs Exploitation
also known as Exploration & Discovery, Curiosity-Driven Action
Balance taking the best-known action (exploit) with trying alternatives that might be better (explore).
Context
A team runs a long-lived agent that repeatedly chooses among a set of options — which tool to call, which prompt template to use, which strategy to try — and can observe an outcome signal after each choice (success, reward, user thumbs-up). Over time the agent should get better at the choice, not just freeze the first decent option in place. This is the classical multi-armed-bandit setting applied to agent decision points.
Problem
An agent that always picks whatever is currently the best-known option (pure exploitation) locks in at whatever local optimum it stumbled into early and never discovers that a different tool or template would have worked better. An agent that always tries something new (pure exploration) burns budget on unproven options and never compounds what it has already learned. Picking the trade-off informally — by gut feel or by occasional manual override — gives neither the predictable improvement of a scheduled policy nor the statistical guarantees that bandit theory provides.
Forces
- Exploration costs (failed attempts) are real.
- Reward signals must exist to shape the trade-off.
- Schedule (epsilon-greedy, UCB, Thompson sampling) is its own design.
Example
An agent that recommends customer-support replies has a strong default template that wins most of the time, so it's used 100% of the time. New phrasings that might be better are never tried, and the system silently sits at a local optimum. The team adds Exploration-Exploitation: 90% of replies use the current best template (exploit) and 10% sample from candidate variants (explore), with outcomes tracked. Within weeks the system surfaces a variant that outperforms the previous best, which then becomes the new exploit.
Diagram
Solution
Therefore:
Pick a strategy: epsilon-greedy (exploit with probability 1-ε), upper-confidence-bound (favor under-explored options with bonus), Thompson sampling (sample from posterior). Apply across tools, strategies, prompts. Track outcomes and adjust.
What this pattern forbids. The agent's action distribution must follow the chosen strategy; unconditional exploitation is forbidden.
The smaller patterns that complete this one —
- generalisesBayesian Bandit Experimentation★— Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.
And the patterns that stand alongside it, or against it —
- complementsLanguage Agent Tree Search·— Lift the agent loop into a search tree with a learned value function and backtracking.
- complementsSkill Library★— Let the agent grow its own toolkit by writing reusable skills that subsequent runs can call.
- complementsSoft-Optimization Cap·— Cap how strongly the agent optimises its inferred objective — sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.