VII · Verification & ReflectionEmerging

Dimensional Synthetic Eval Set

also known as Tuple-Seeded Eval Generation, Dimensional Mode-Collapse Avoidance

Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.

Context

A team needs to expand its evaluation set for an LLM application. Asking an LLM 'generate 200 evaluation prompts for this feature' produces a corpus that mode-collapses to a few archetypes the LLM finds most likely. The eval set looks varied but covers only a sliver of the actual input space.

Problem

Free-form synthetic eval generation has a known failure mode: the generating LLM converges on its high-likelihood prompt shapes, and the resulting set is monotonous regardless of how many items are generated. The team's coverage of the genuine input space (different personas, different scenarios, different complexity levels, different modalities) is poor and the team cannot see this from the surface variety of the prompts.

Forces

  • Free-form generation mode-collapses; sampling more does not fix it.
  • Coverage of named dimensions is the actual property the eval set needs.
  • Naming dimensions explicitly is itself useful documentation.
  • Tuple enumeration scales by the product of dimension cardinalities — needs sampling.

Example

A team building a customer-support agent names three dimensions: persona (new / returning / staff), scenario (success / blocked / ambiguous), product-area (billing / shipping / returns). The 3×3×3 = 27 tuple grid drives generation; each tuple produces 10 eval inputs. The 270-item eval set has visible coverage per cell. A subsequent review notices that the (staff × ambiguous × returns) cell is the weakest; the team adds focused items there.

Diagram

Solution

Therefore:

List the named dimensions of the input space: persona (new user / power user / staff), feature (the feature variants the agent will face), scenario (success / failure / ambiguous), modality (text / voice / image). Generate the cross-product of tuples; sample if it's too large. For each tuple, ask the LLM to generate eval inputs grounded in that tuple's specifics. The resulting set covers the dimensions by construction. Coverage gaps are visible — the tuple grid shows which combinations are empty.

What this pattern forbids. Synthetic eval inputs must not be generated by free-form LLM prompting alone; generation is seeded from tuples over explicitly named dimensions to bound mode-collapse.

The smaller patterns that complete this one —

  • usesEval Harness★★Run a held-out dataset against agent versions to detect regressions and measure improvement.

And the patterns that stand alongside it, or against it —

  • composes-with[evaluation-driven-development]
  • composes-withPrompt Variant Evaluation★★Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.
  • complementsFrozen Rubric ReflectionConstrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
  • complementsLLM-as-Judge★★Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.