X · Governance & ObservabilityEmerging

Bayesian Bandit Experimentation

also known as Multi-Armed Bandit for Prompt Variants, Bandit-Based Agent Rollout

Replace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.

This pattern helps complete certain larger patterns —

Context

An agent team has multiple variants in play: two prompt templates, three model choices, two retrieval strategies. They want to learn which performs best on production traffic without exposing many users to the worse variants for the full length of a classical A/B test.

Problem

A fixed 50/50 (or N-way uniform) split between variants pays regret on every losing variant for the entire experiment window. With multiple simultaneous variants the regret compounds. Worse, the experiment cannot be stopped early without invalidating the statistics; teams keep losing variants live for weeks because the rollout calendar said so. A static split is wrong as a learning policy when the team genuinely cares about user outcomes during the experiment.

Forces

  • Some variants are clearly worse early; continuing uniform allocation pays regret.
  • Some variants need many trials to reveal their advantage; aggressive exploitation kills them.
  • Reward signals (task success, user satisfaction, cost) arrive with delay and noise.
  • Operators need to be able to read off 'which variant is winning' at any point.

Example

A support-agent team has four candidate prompt templates and two candidate models. They run all eight (template × model) combinations as bandit arms with Thompson sampling over downstream user-rating reward. By day three two arms have collected enough credible evidence to promote; the bandit allocates >70% of traffic to them and continues exploring the rest at low rate.

Diagram

Solution

Therefore:

Treat each variant as a bandit arm. After each request, record the variant chosen and (when it arrives) the reward (task success, satisfaction, cost). A Thompson sampler or upper-confidence-bound policy decides allocation for the next request. Run for a budget of requests or until posterior separation crosses a threshold; promote the winner. Surface posterior means and credible intervals in the experiment dashboard.

What this pattern forbids. Variant allocation must not be a fixed-fraction split when reward can be observed online; the policy must update from observed reward and shift allocation accordingly.

The smaller patterns that complete this one —

  • usesEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.

And the patterns that stand alongside it, or against it —

  • alternative-toShadow Canary★★Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
  • complementsEvaluator-Optimizer★★One LLM generates; another evaluates and feeds back; loop until criteria are met.
  • complements[evaluation-driven-development]
  • composes-withPrompt Variant Evaluation★★Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.
  • alternative-toTrust and Reputation RoutingMaintain a per-agent reputation score updated from outcome quality and peer feedback, and route new tasks preferentially to high-reputation agents.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

Provenance