Rubric and Grounding Profile Evaluation
also known as profile rubric eval, profile-grading harness
Compare candidate agent profiles, the personas an agent can adopt, in a fair head-to-head. Run each profile over the same test set in one batch. Grade output quality with a model-judged rubric. Separately check grounding, that is, whether the answers stay tied to the source text. Both scores are computed in a batch harness, in the style of Azure Prompt Flow, then rolled up per profile. The winning prompt or persona is chosen on numbers, not vibes. This differs from plain model-as-judge grading in two ways. The thing you compare is the whole profile, meaning the system prompt, persona, and instructions. And grounding to the source is a full second axis next to the quality rubric.
Methodology process overview
Intent. Pick the best agent profile from a set of candidates by scoring each one on a frozen quality rubric and a source-grounding check, run in batch.
When to apply. Use this when a team has several candidate system prompts, personas, or profile templates for the same agent role and must pick one. Examples: comparing tone variants for a support agent, or comparing instruction styles for an assistant grounded in policy docs. Don't apply it when there is only one candidate profile. Skip it too when the task has no source text, so grounding has nothing to check, as in pure creative writing.
Inputs
- Candidate profiles — Two or more system-prompt or persona definitions to compare.
- Eval set — The inputs the agent must respond to. Ideally each comes with the source documents the answer should be based on.
- Rubric prompt — Fixed instructions for the judge. They cover the quality points to score on each output.
- Grounding checker — A prompt or routine that flags any part of an answer the source text does not back up.
Outputs
- Per-profile rubric scores — The rolled-up quality scores for each candidate profile.
- Per-profile grounding scores — The share of an answer's claims that can be traced back to the source text, per profile.
- Profile comparison report — A side-by-side leaderboard of quality versus grounding for each candidate, with the winner picked.
Steps (6)
Author the rubric and grounding checks
Write a frozen model-judge rubric for the quality points: helpfulness, tone, brevity, and following the format. Separately, write a grounding checker. It takes an answer and its source, then marks each claim as supported or not.
Enumerate candidate profiles
List the system prompts and persona variants to compare, each pinned to a version. Treat each profile as one sealed unit. Do not edit any of them mid-test.
Batch-run the eval set per profile
For each profile, run the agent over every test input in a batch harness, such as Azure Prompt Flow or similar. Save every row: the profile, the input, the output, and the source.
Score rubric and grounding independently
Run the rubric judge on every output. Run the grounding checker on those same outputs against their source text. Save both scores on each row.
usesLLM-as-Judge
Aggregate per profile and pick the winner
Work out the average score and pass rate per profile on both axes. Reject any profile that wins on quality but loses on grounding. A smooth-sounding but unfaithful profile is a failure, not a winner.
Spot-check the leader with humans
Have an expert read a sample of the winning profile's outputs. Confirm the quality and grounding scores match what a person would say before you promote the profile.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Quality and grounding are separate axes. Never average them into one number.
- The whole profile is what you compare, not single prompt tweaks inside a profile.
- A grounding failure outweighs a quality win. A smooth but unfaithful answer is worse than a clumsy but faithful one.
- Picking a profile is a frozen-test event. Once chosen, the profile is locked until the next re-test.
Known failure modes (2)
Related patterns (4)
- ★Agent Persona Profile
Treat agent identity as a structured profile object — persona, primary motivator, allowed actions, knowledge bindings — rather than a free-form role sentence in the system prompt.
- ★★LLM-as-Judge
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
- ★Frozen Rubric Reflection
Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
- ·Personality Variant Overlay
Let one agent speak in several named voices that overlay the base identity rather than replacing it, so the agent can shift register without losing identity continuity or splitting into separate personas.
Related compositions (1)
Related methodologies (2)
Sources (2)
AI Agents in Action
Ch 9 §9.4 'Evaluating profiles: Rubrics and grounding', §9.7 'Comparing profiles: Getting the perfect profile' “9.4 Evaluating profiles: Rubrics and grounding ... 9.7 Comparing profiles: Getting the perfect profile”
AI Agents in Action — Chapter 9 (Manning liveBook)
Ch 9 §9.4–§9.7 “9.4 Evaluating profiles: Rubrics and grounding ... 9.5 Understanding rubrics and grounding ... 9.6 Grounding evaluation with an LLM profile ... 9.7 Comparing profiles: Getting the perfect profile”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified