VII · Verification & ReflectionEmerging★

Dimensional Synthetic Eval Set

also known as Tuple-Seeded Eval Generation, Dimensional Mode-Collapse Avoidance

Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.

Context

A team needs to expand its evaluation set for an LLM application. Asking an LLM 'generate 200 evaluation prompts for this feature' produces a corpus that mode-collapses to a few archetypes the LLM finds most likely. The eval set looks varied but covers only a sliver of the actual input space.

Problem

Free-form synthetic eval generation has a known failure mode: the generating LLM converges on its high-likelihood prompt shapes, and the resulting set is monotonous regardless of how many items are generated. The team's coverage of the genuine input space (different personas, different scenarios, different complexity levels, different modalities) is poor and the team cannot see this from the surface variety of the prompts.

Forces

Free-form generation mode-collapses; sampling more does not fix it.
Coverage of named dimensions is the actual property the eval set needs.
Naming dimensions explicitly is itself useful documentation.
Tuple enumeration scales by the product of dimension cardinalities — needs sampling.

Example

A team building a customer-support agent names three dimensions: persona (new / returning / staff), scenario (success / blocked / ambiguous), product-area (billing / shipping / returns). The 3×3×3 = 27 tuple grid drives generation; each tuple produces 10 eval inputs. The 270-item eval set has visible coverage per cell. A subsequent review notices that the (staff × ambiguous × returns) cell is the weakest; the team adds focused items there.

Diagram

flowchart LR D1[Persona dims] --> T[Tuple grid] D2[Scenario dims] --> T D3[Modality dims] --> T T --> Seed[Seed: one tuple at a time] Seed --> Gen[Generate eval input] Gen --> Set[Eval set] Set --> Cov[Coverage map per cell]

Solution

Therefore:

List the named dimensions of the input space: persona (new user / power user / staff), feature (the feature variants the agent will face), scenario (success / failure / ambiguous), modality (text / voice / image). Generate the cross-product of tuples; sample if it's too large. For each tuple, ask the LLM to generate eval inputs grounded in that tuple's specifics. The resulting set covers the dimensions by construction. Coverage gaps are visible — the tuple grid shows which combinations are empty.

What this pattern forbids. Synthetic eval inputs must not be generated by free-form LLM prompting alone; generation is seeded from tuples over explicitly named dimensions to bound mode-collapse.

The smaller patterns that complete this one —

usesEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.

And the patterns that stand alongside it, or against it —

composes-with[evaluation-driven-development]
composes-withPrompt Variant Evaluation★★— Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.
complementsFrozen Rubric Reflection★— Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
complementsLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Production LLM Platform
hardening
Mode-collapse-resistant offline eval coverage.

Used in frameworks

Ragas (synthetic testset generation)
core3 patternsOrchestration Frameworks★★ mature
Ragas rejects naive LLM prompting (which mode-collapses) and instead enumerates named question evolution types (simple, reasoning, multi_context, conditional) with a configurable…

References

Provenance

Source: patterns/dimensional-synthetic-eval-set.md on GitHub · commit 135ae3c · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.