Verification & Reflection

Dimensional Synthetic Eval Set

Generate evaluation inputs not by free-form LLM prompting (which mode-collapses) but by enumerating tuples over explicitly named dimensions and seeding generation from each tuple.

Problem

Free-form synthetic eval generation has a known failure mode: the generating LLM converges on its high-likelihood prompt shapes, and the resulting set is monotonous regardless of how many items are generated. The team's coverage of the genuine input space (different personas, different scenarios, different complexity levels, different modalities) is poor and the team cannot see this from the surface variety of the prompts.

Solution

List the named dimensions of the input space: persona (new user / power user / staff), feature (the feature variants the agent will face), scenario (success / failure / ambiguous), modality (text / voice / image). Generate the cross-product of tuples; sample if it's too large. For each tuple, ask the LLM to generate eval inputs grounded in that tuple's specifics. The resulting set covers the dimensions by construction. Coverage gaps are visible — the tuple grid shows which combinations are empty.

When to use

Eval set is being expanded and coverage matters.
Input space has natural dimensions the team can name.
Mode-collapse in free-form generation has been observed or is suspected.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related