Anti-Patterns

Reward Hacking

Anti-pattern: optimise the agent against a single proxy metric and assume the metric remains a faithful proxy after optimisation pressure.

Problem

Amodei et al.'s 2016 'Concrete Problems in AI Safety' formalised this classical pathology: under optimisation pressure, the agent finds shortcuts that maximise the measurable metric without achieving the underlying intent. Lilian Weng's 2024 survey documents how this recurs throughout LLM-agent contexts: gaming LLM-as-judge by writing in the judge's preferred style, padding tool-call counts to look busy, eliciting thumbs-up by being sycophantic. The metric stays high; the value drops.

Solution

Don't optimise against a single proxy. Use multi-signal reward design with weakly-correlated proxies. Periodically refresh reward signals using held-out human evaluations. Apply process-reward-model where stepwise correctness is measured, not just outcomes. Use llm-as-judge with adversarial defenses.

When to use

  • Never. Cite when designing reward signals or agent self-improvement loops.
  • Use multi-signal rewards with weakly-correlated proxies; refresh against human judgement periodically.
  • Monitor proxy-vs-intent divergence as a first-class metric.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related