Methodology · Evaluationemergingverified

Rubric and Grounding Profile Evaluation

also known as profile rubric eval, profile-grading harness

Applies to: agentllm-apprag-system

Tags: profile-selectionrubricgroundingprompt-flow

Compare candidate agent profiles, the personas an agent can adopt, in a fair head-to-head. Run each profile over the same test set in one batch. Grade output quality with a model-judged rubric. Separately check grounding, that is, whether the answers stay tied to the source text. Both scores are computed in a batch harness, in the style of Azure Prompt Flow, then rolled up per profile. The winning prompt or persona is chosen on numbers, not vibes. This differs from plain model-as-judge grading in two ways. The thing you compare is the whole profile, meaning the system prompt, persona, and instructions. And grounding to the source is a full second axis next to the quality rubric.

Methodology process overview

Intent. Pick the best agent profile from a set of candidates by scoring each one on a frozen quality rubric and a source-grounding check, run in batch.

When to apply. Use this when a team has several candidate system prompts, personas, or profile templates for the same agent role and must pick one. Examples: comparing tone variants for a support agent, or comparing instruction styles for an assistant grounded in policy docs. Don't apply it when there is only one candidate profile. Skip it too when the task has no source text, so grounding has nothing to check, as in pure creative writing.

Inputs

  • Candidate profilesTwo or more system-prompt or persona definitions to compare.
  • Eval setThe inputs the agent must respond to. Ideally each comes with the source documents the answer should be based on.
  • Rubric promptFixed instructions for the judge. They cover the quality points to score on each output.
  • Grounding checkerA prompt or routine that flags any part of an answer the source text does not back up.

Outputs

  • Per-profile rubric scoresThe rolled-up quality scores for each candidate profile.
  • Per-profile grounding scoresThe share of an answer's claims that can be traced back to the source text, per profile.
  • Profile comparison reportA side-by-side leaderboard of quality versus grounding for each candidate, with the winner picked.

Steps (6)

  1. Author the rubric and grounding checks

    Write a frozen model-judge rubric for the quality points: helpfulness, tone, brevity, and following the format. Separately, write a grounding checker. It takes an answer and its source, then marks each claim as supported or not.

  2. Enumerate candidate profiles

    List the system prompts and persona variants to compare, each pinned to a version. Treat each profile as one sealed unit. Do not edit any of them mid-test.

    usesAgent Persona Profile

  3. Batch-run the eval set per profile

    For each profile, run the agent over every test input in a batch harness, such as Azure Prompt Flow or similar. Save every row: the profile, the input, the output, and the source.

  4. Score rubric and grounding independently

    Run the rubric judge on every output. Run the grounding checker on those same outputs against their source text. Save both scores on each row.

    usesLLM-as-Judge

  5. Aggregate per profile and pick the winner

    Work out the average score and pass rate per profile on both axes. Reject any profile that wins on quality but loses on grounding. A smooth-sounding but unfaithful profile is a failure, not a winner.

  6. Spot-check the leader with humans

    Have an expert read a sample of the winning profile's outputs. Confirm the quality and grounding scores match what a person would say before you promote the profile.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Quality and grounding are separate axes. Never average them into one number.
  • The whole profile is what you compare, not single prompt tweaks inside a profile.
  • A grounding failure outweighs a quality win. A smooth but unfaithful answer is worse than a clumsy but faithful one.
  • Picking a profile is a frozen-test event. Once chosen, the profile is locked until the next re-test.

Known failure modes (2)

Related patterns (4)

Related compositions (1)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified