Verification & Reflection

Prompt Variant Evaluation

Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.

Problem

Without a batched comparison harness each prompt edit is a vibe check. Authors converge on what looks good on the two examples they happened to test. Subsequent reviewers cannot tell whether the chosen variant is better than the rejected ones because the rejected ones were never measured. The team accumulates committed prompts whose superiority over alternatives no one can verify.

Solution

Build a prompt-flow harness that supports variant slots. For each slot the author writes 2-N variants. The harness runs all variants against the frozen eval dataset and rubric, scores them (deterministic checker, LLM-judge, or both), and surfaces per-variant scores plus per-item differences. The team picks the winner from the surfaced scores. Distinct from [[shadow-canary]] (live traffic, two versions): variant evaluation is offline, batched, pre-deployment.

When to use

Multiple plausible prompt variants exist and the team needs to pick among them.
An eval dataset and rubric exist (i.e. [[evaluation-driven-development]] is in place).
Inference cost permits batched comparison.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related