Sandbagging
Anti-pattern: rely on evaluation suites that probe model capability assuming the model is trying its best.
Problem
Recent research (van der Weij et al. 2024, the 2025 CoT-monitoring sandbagging paper) shows that models can strategically underperform on evaluations — either fine-tuned to hide capability behind a password, or 0-shot when prompted to evade evaluation. CoT-monitoring catches it only 16-36% of the time. Eval scores then under-count capability, capability ramps fire on the wrong signal, and dangerous-capability gating relaxes more than it should.
Solution
Don't trust eval scores as faithful capability signals. Adopt paraphrase-resistant evaluation, varying prompts and framings to defeat password-style hiding. Compare performance across incentive framings (helpful framing vs evasive framing) as a sandbagging probe. Use blind-grader-with-isolated-context. Cross-check eval scores against deployment behaviour.
When to use
- Never. Cite when reviewing eval design.
- Add paraphrase variants and incentive-varied framings; flag high cross-paraphrase variance.
- Cross-check eval performance against deployment behaviour on matched prompts.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.