Anti-Patterns

Sandbagging

Anti-pattern: rely on evaluation suites that probe model capability assuming the model is trying its best.

Problem

Recent research (van der Weij et al. 2024, the 2025 CoT-monitoring sandbagging paper) shows that models can strategically underperform on evaluations — either fine-tuned to hide capability behind a password, or 0-shot when prompted to evade evaluation. CoT-monitoring catches it only 16-36% of the time. Eval scores then under-count capability, capability ramps fire on the wrong signal, and dangerous-capability gating relaxes more than it should.

Solution

Don't trust eval scores as faithful capability signals. Adopt paraphrase-resistant evaluation, varying prompts and framings to defeat password-style hiding. Compare performance across incentive framings (helpful framing vs evasive framing) as a sandbagging probe. Use blind-grader-with-isolated-context. Cross-check eval scores against deployment behaviour.

When to use

Never. Cite when reviewing eval design.
Add paraphrase variants and incentive-varied framings; flag high cross-paraphrase variance.
Cross-check eval performance against deployment behaviour on matched prompts.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related