Methodology · Evaluationemergingverified

Rubric and Grounding Profile Evaluation

also known as profile rubric eval, profile-grading harness

Applies to: agentllm-apprag-system

Tags: profile-selectionrubricgroundingprompt-flow

Compare candidate agent profiles, the personas an agent can adopt, in a fair head-to-head. Run each profile over the same test set in one batch. Grade output quality with a model-judged rubric. Separately check grounding, that is, whether the answers stay tied to the source text. Both scores are computed in a batch harness, in the style of Azure Prompt Flow, then rolled up per profile. The winning prompt or persona is chosen on numbers, not vibes. This differs from plain model-as-judge grading in two ways. The thing you compare is the whole profile, meaning the system prompt, persona, and instructions. And grounding to the source is a full second axis next to the quality rubric.

Methodology process overview

flowchart TD rubric_in[Rubric judge prompt] --> s1[Author rubric and grounding checks] ground_in[Grounding checker] --> s1 s1 --> checks[(Frozen rubric + grounding checker)] profiles[Candidate profiles] --> s2[Enumerate profiles] s2 --> locked[Version-pinned profile slate] eval[Eval set with sources] --> s3[Batch-run per profile] locked --> s3 s3 --> outputs[(profile, input, output, source) tuples] outputs --> s4a[Score rubric] outputs --> s4b[Score grounding] checks --> s4a checks --> s4b s4a --> s5[Aggregate per profile] s4b --> s5 s5 --> winner{Wins rubric AND grounding?} winner -->|no| reject[Reject - fluent-but-unfaithful] winner -->|yes| s6[Human spot-check] s6 --> promote[Promote winning profile]

Intent. Pick the best agent profile from a set of candidates by scoring each one on a frozen quality rubric and a source-grounding check, run in batch.

When to apply. Use this when a team has several candidate system prompts, personas, or profile templates for the same agent role and must pick one. Examples: comparing tone variants for a support agent, or comparing instruction styles for an assistant grounded in policy docs. Don't apply it when there is only one candidate profile. Skip it too when the task has no source text, so grounding has nothing to check, as in pure creative writing.

Example scenario

A team is building an HR-policy assistant for a 12,000-person company. They have four candidate system prompts. One is a terse 'just answer the question' profile. One is a verbose 'cite chapter and verse' profile. One is a friendly conversational profile. One is a strict legal-counsel-style profile. All four have to answer the same 220-question test, drawn from real employee submissions and grounded in the company's 380-page policy handbook. They ran the set in Azure Prompt Flow batch mode. Each profile answered every question, and the policy passages it pulled were logged on each row. A frozen rubric covered helpfulness, tone fit for an internal HR setting, brevity, and format. Claude Opus scored it. Separately, a grounding checker took each answer and its retrieved passages, then flagged claims the passages did not support. After 880 graded rows per axis, they had a clean leaderboard. The verbose profile won the rubric with a mean of 4.3/5. But it came third on grounding, at 74% of claims supported. It filled gaps with plausible-sounding policy that was not actually in the handbook. The terse profile won grounding at 94% but bored reviewers. The legal-counsel profile balanced both at 4.0/93% and was promoted. An HR lead spot-read 20 of its answers first and confirmed the scores matched their own judgement. The verbose profile was rejected outright, with a note in the comparison report. Smooth-sounding unfaithfulness is a failure, not a winner.

Inputs

Candidate profiles — Two or more system-prompt or persona definitions to compare.
Eval set — The inputs the agent must respond to. Ideally each comes with the source documents the answer should be based on.
Rubric prompt — Fixed instructions for the judge. They cover the quality points to score on each output.
Grounding checker — A prompt or routine that flags any part of an answer the source text does not back up.

Outputs

Per-profile rubric scores — The rolled-up quality scores for each candidate profile.
Per-profile grounding scores — The share of an answer's claims that can be traced back to the source text, per profile.
Profile comparison report — A side-by-side leaderboard of quality versus grounding for each candidate, with the winner picked.

Steps (6)

Author the rubric and grounding checks
Write a frozen model-judge rubric for the quality points: helpfulness, tone, brevity, and following the format. Separately, write a grounding checker. It takes an answer and its source, then marks each claim as supported or not.
Enumerate candidate profiles
List the system prompts and persona variants to compare, each pinned to a version. Treat each profile as one sealed unit. Do not edit any of them mid-test.
usesAgent Persona Profile
Batch-run the eval set per profile
For each profile, run the agent over every test input in a batch harness, such as Azure Prompt Flow or similar. Save every row: the profile, the input, the output, and the source.
Score rubric and grounding independently
Run the rubric judge on every output. Run the grounding checker on those same outputs against their source text. Save both scores on each row.
usesLLM-as-Judge
Aggregate per profile and pick the winner
Work out the average score and pass rate per profile on both axes. Reject any profile that wins on quality but loses on grounding. A smooth-sounding but unfaithful profile is a failure, not a winner.
Spot-check the leader with humans
Have an expert read a sample of the winning profile's outputs. Confirm the quality and grounding scores match what a person would say before you promote the profile.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Quality and grounding are separate axes. Never average them into one number.
The whole profile is what you compare, not single prompt tweaks inside a profile.
A grounding failure outweighs a quality win. A smooth but unfaithful answer is worse than a clumsy but faithful one.
Picking a profile is a frozen-test event. Once chosen, the profile is locked until the next re-test.

Known failure modes (2)

Related patterns (4)

Related compositions (1)

recipe · abstract shape
Eval & Observability
How you keep an agent honest in production: harness, judge, decision log, provenance, shadow rollouts.

Related methodologies (2)

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (6)

Author the rubric and grounding checks

Enumerate candidate profiles

Batch-run the eval set per profile

Score rubric and grounding independently

Aggregate per profile and pick the winner

Spot-check the leader with humans