VII · Verification & ReflectionMature★★

Prompt Variant Evaluation

also known as Prompt Flow Variant Compare, Batch-Variant Evaluation

Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.

Context

A team is iterating on a prompt — different wordings, different examples, different model bindings. Selecting between variants by demo or by author taste produces non-reproducible decisions and loses the comparator the moment the demo is forgotten.

Problem

Without a batched comparison harness each prompt edit is a vibe check. Authors converge on what looks good on the two examples they happened to test. Subsequent reviewers cannot tell whether the chosen variant is better than the rejected ones because the rejected ones were never measured. The team accumulates committed prompts whose superiority over alternatives no one can verify.

Forces

  • Variants must run against the same dataset for comparison to be valid.
  • The eval rubric must be frozen before the variants run, or scoring is post-hoc rationalisation.
  • Multiple variants per slot multiply cost — sensible batch size matters.
  • Winners must be inspectable: per-variant scores, per-item differences.

Example

A team has a profile-extraction prompt with five candidate wordings. They run all five against the frozen 100-item eval set using an LLM-judge rubric. Variant 3 wins on overall score but variant 5 wins on a specific edge-case slice; the team picks variant 3 and adds the edge cases to the eval set so future runs measure them.

Diagram

Solution

Therefore:

Build a prompt-flow harness that supports variant slots. For each slot the author writes 2-N variants. The harness runs all variants against the frozen eval dataset and rubric, scores them (deterministic checker, LLM-judge, or both), and surfaces per-variant scores plus per-item differences. The team picks the winner from the surfaced scores. Distinct from [[shadow-canary]] (live traffic, two versions): variant evaluation is offline, batched, pre-deployment.

What this pattern forbids. A prompt edit must not be selected by demo or author taste; variants are evaluated as a batch against the frozen rubric and the winner is selected by measured score.

The smaller patterns that complete this one —

  • usesEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.
  • usesFrozen Rubric ReflectionConstrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
  • usesLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

And the patterns that stand alongside it, or against it —

  • composes-with[evaluation-driven-development]
  • composes-withBayesian Bandit ExperimentationReplace fixed-split A/B tests between agent variants with a bandit that dynamically reallocates traffic toward better-performing variants based on observed reward, bounding regret from bad variants.
  • alternative-toShadow Canary★★Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
  • complementsPrompt Versioning★★Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
  • composes-withDimensional Synthetic Eval SetGenerate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

References

Provenance