Methodology · Evaluation

Rubric and Grounding Profile Evaluation

Pick the best agent profile from a set of candidates by scoring each one on a frozen quality rubric and a source-grounding check, run in batch.

Description

Compare candidate agent profiles, the personas an agent can adopt, in a fair head-to-head. Run each profile over the same test set in one batch. Grade output quality with a model-judged rubric. Separately check grounding, that is, whether the answers stay tied to the source text. Both scores are computed in a batch harness, in the style of Azure Prompt Flow, then rolled up per profile. The winning prompt or persona is chosen on numbers, not vibes. This differs from plain model-as-judge grading in two ways. The thing you compare is the whole profile, meaning the system prompt, persona, and instructions. And grounding to the source is a full second axis next to the quality rubric.

When to apply

Use this when a team has several candidate system prompts, personas, or profile templates for the same agent role and must pick one. Examples: comparing tone variants for a support agent, or comparing instruction styles for an assistant grounded in policy docs. Don't apply it when there is only one candidate profile. Skip it too when the task has no source text, so grounding has nothing to check, as in pure creative writing.

What it involves

  • Author the rubric and grounding checks
  • Enumerate candidate profiles
  • Batch-run the eval set per profile
  • Score rubric and grounding independently
  • Aggregate per profile and pick the winner
  • Spot-check the leader with humans

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related