Bayesian Bandit Experimentation

also known as Multi-Armed Bandit for Prompt Variants, Bandit-Based Agent Rollout

Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.

This pattern helps complete certain larger patterns —

specialisesExploration vs Exploitation★— Balance taking the best-known action (exploit) with trying alternatives that might be better (explore).

Context

An agent team has multiple variants in play: two prompt templates, three model choices, two retrieval strategies. They want to learn which performs best on production traffic without exposing many users to the worse variants for the full length of a classical A/B test.

Problem

A fixed 50/50 (or N-way uniform) split between variants pays regret on every losing variant for the entire experiment window. With multiple simultaneous variants the regret compounds. Worse, the experiment cannot be stopped early without invalidating the statistics; teams keep losing variants live for weeks because the rollout calendar said so. A static split is wrong as a learning policy when the team genuinely cares about user outcomes during the experiment.

Forces

Some variants are clearly worse early; continuing uniform allocation pays regret.
Some variants need many trials to reveal their advantage; aggressive exploitation kills them.
Reward signals (task success, user satisfaction, cost) arrive with delay and noise.
Operators need to be able to read off 'which variant is winning' at any point.

Example

A support-agent team has four candidate prompt templates and two candidate models. They run all eight (template × model) combinations as bandit arms with Thompson sampling over downstream user-rating reward. By day three two arms have collected enough credible evidence to promote; the bandit allocates >70% of traffic to them and continues exploring the rest at low rate.

Diagram

flowchart TD Req[Incoming request] --> Pol[Bandit policy] Pol --> V1[Variant A] Pol --> V2[Variant B] Pol --> V3[Variant C] V1 --> R[Observe reward] V2 --> R V3 --> R R --> Upd[Update posteriors] Upd --> Pol

Solution

Therefore:

Treat each variant as a bandit arm. After each request, record the variant chosen and (when it arrives) the reward (task success, satisfaction, cost). A Thompson sampler or upper-confidence-bound policy decides allocation for the next request. Run for a budget of requests or until posterior separation crosses a threshold; promote the winner. Surface posterior means and credible intervals in the experiment dashboard.

What this pattern forbids. Variant allocation must not be a fixed-fraction split when reward can be observed online; the policy must update from observed reward and shift allocation accordingly.

The smaller patterns that complete this one —

usesEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.

And the patterns that stand alongside it, or against it —

alternative-toShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
complementsEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
complements[evaluation-driven-development]
composes-withPrompt Variant Evaluation★★— Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.
alternative-toTrust and Reputation Routing★— Maintain a per-agent reputation score updated from outcome quality and peer feedback, and route new tasks preferentially to high-reputation agents.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

Statsig
first-class3 patternsEnterprise Platforms★★ mature
Statsig Autotune is a multi-armed bandit that uses Thompson sampling to dynamically reallocate traffic toward better-performing variants in real time, minimizing regret rather tha…

References

Building Applications with AI Agents
book

Provenance

Source: patterns/bayesian-bandit-experimentation.md on GitHub · commit 135ae3c · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.