Missing max_tokens Cap
Anti-pattern: call the model without an explicit max_tokens (or equivalent) so a single call can drain the run's budget on a runaway generation.
Problem
A single hallucinated loop in the output (the model rambling, repeating, or generating filler) consumes the full context budget on one call. This dominates the run cost. Worse, a slow generation locks up the agent thread for tens of seconds. Distinct from step-budget (which caps total agent steps) and cost-gating (which caps total spend) — this is the per-call output cap.
Solution
Set max_tokens per call site based on output schema. For structured-output schemas, derive the cap from the schema. For prose, use task-class defaults. Alert on cap-hit rate as a quality signal (it indicates undersized cap OR runaway generation). Pair with structured-output and step-budget.
When to use
- Never. Cite when reviewing model call sites.
- Set explicit max_tokens per call matched to expected output.
- Alert on cap-hit rate as a quality signal.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.