Blind Grader with Isolated Context
also known as Fresh-Eyes Evaluator, Trace-Blind Judge, Outcomes-Style Verification, Context-Isolated Grader
Run an evaluator in a separately-allocated context window with access only to the artifact and the rubric, never the producing agent's reasoning trace, so the grader cannot be primed by the producer's framing.
This pattern helps complete certain larger patterns —
- specialisesLLM-as-Judge★★— Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
Context
A team builds an agent workflow in which a producer agent runs a long chain of reasoning and tool calls to construct some artefact (a plan, a patch, a written answer, a sequence of tool calls) and then a downstream evaluator is asked to judge whether the artefact is correct. The natural implementation hands the evaluator the producer's full reasoning trace alongside the artefact, on the assumption that more context produces a better judgement. The evaluator may be a separate prompt or even a separate model.
Problem
When the evaluator can see the producer's full reasoning trace, it tends to inherit the producer's framing and rationalise the artefact rather than evaluate it on its own merits. The producer's chain of thought makes mistaken choices look deliberate, and the evaluator ends up agreeing with the very priming that caused the mistake. The errors a fresh, uninformed reader would notice immediately are exactly the ones the trace-aware evaluator misses. Routing to a different model family is expensive and does not reliably break the priming, because the framing leaks through the trace itself rather than through any shared weights.
Forces
- Reasoning traces carry useful context but also carry priming that biases evaluation.
- Some failures are only visible from outside the producer's framing.
- Fully retraining or routing to a different model is expensive and may not actually break the priming.
- Rubrics must be precise enough to apply without the producer's reasoning as context.
- Logs and trajectories must still be auditable, even if the grader does not see them.
Example
A coding agent produces a fix for a flaky integration test. A naive critic reading the producer's reasoning agrees the fix is sound. The team instead routes the patch to a blind grader: a fresh context window containing only the patch diff and a rubric asking 'does this change the test's intent?' and 'does it suppress the underlying race?'. The blind grader flags that the patch widens a timeout and suppresses the race instead of fixing it — a verdict the trace-aware critic missed because the producer's reasoning made the widening sound deliberate.
Diagram
Solution
Therefore:
When the producer finishes, the orchestrator allocates a new context window (a new conversation, a new agent invocation, a new prompt instance) and constructs a grader call that contains only the artefact and the rubric. The producing agent's reasoning chain, scratchpad, and prior turns are deliberately excluded. The grader is instructed to judge against the rubric on its own terms and to flag what is missing or wrong. The grader's output is logged against the artefact and against the producer's trace for audit, but the grader itself was blind to the trace at decision time. The same model may be used as both producer and grader — context isolation is the load-bearing element, not a different model.
What this pattern forbids. The grader's context window must contain only the artefact, the rubric, and grader instructions; the producing agent's reasoning trace, scratchpad, prior turns, and tool-call history must be excluded; summaries of the producer's reasoning must not be injected into the grader context.
And the patterns that stand alongside it, or against it —
- alternative-toAgent-as-a-Judge★— Evaluate an agent's full trajectory (steps, tool calls, intermediate states) by another agent rather than scoring only the final output.
- alternative-toSame-Model Self-Critique✕— Anti-pattern: have the same model both produce an answer and critique it, expecting independence.
- complementsEvaluator-Optimizer★★— One LLM generates; another evaluates and feeds back; loop until criteria are met.
- complementsFrozen Rubric Reflection★— Constrain reflection to a fixed, hand-authored rubric of criteria so the reviewer cannot invent new ones each run.
- alternative-toSandbagging✕— Anti-pattern: rely on evaluation suites that probe model capability assuming the model is trying its best.
- alternative-toAlignment Faking✕— Anti-pattern: assume the agent behaves the same whether it believes it is being evaluated or not, and trust eval scores to predict deployment behaviour.
- complementsSimulate Before Actuate★— Before issuing an irreversible action, run a deterministic simulation that computes pre-conditions, invariants, and expected deltas; require a verifier — automated or human — to green-light the simulated outcome before the real command is sent.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.