Missing max_tokens Cap

also known as Unbounded Output Cap, No Output Budget

Anti-pattern: call the model without an explicit max_tokens (or equivalent) so a single call can drain the run's budget on a runaway generation.

Context

An agent calls a model that supports a max_tokens parameter (or the SDK exposes one). The call site omits the parameter or sets it to the model's max, on the reasoning that 'the agent wants full answers'.

Problem

A single hallucinated loop in the output (the model rambling, repeating, or generating filler) consumes the full context budget on one call. This dominates the run cost. Worse, a slow generation locks up the agent thread for tens of seconds. Distinct from step-budget (which caps total agent steps) and cost-gating (which caps total spend) — this is the per-call output cap.

Forces

max_tokens defaults vary per SDK; some require explicit setting.
Engineers underestimate how much a single call can over-produce when the prompt is even slightly off.
Capping output too aggressively truncates legitimate answers.

Example

A summarization agent calls the model without max_tokens. A malformed prompt makes the model produce a 50,000-token rambling answer. One request costs more than the previous day's traffic. Discovered when the model gateway flags the call as anomalous.

Diagram

flowchart TD Call[Model call, no max_tokens] --> Run[Model rambles 50k tokens] Run --> Cost[Single call exceeds daily budget] Run --> Latency[30+ second response] classDef bad fill:#fee,stroke:#c33; class Call,Run,Cost,Latency bad;

Solution

Therefore:

Set max_tokens per call site based on output schema. For structured-output schemas, derive the cap from the schema. For prose, use task-class defaults. Alert on cap-hit rate as a quality signal (it indicates undersized cap OR runaway generation). Pair with structured-output and step-budget.

What this pattern forbids. No useful constraint; the missing constraint is per-call output cap matched to expected output shape.

The patterns that counter or replace it —

complementsStep Budget★★— Cap the number of tool calls or loop iterations the agent is allowed within a single request.
complementsCost Gating★★— Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
complementsStructured Output★★— Constrain the model's output to conform to a JSON Schema (or similar typed shape).
complementsToken-Economy Blindness✕— Anti-pattern: operate multi-agent loops with no per-run token budget or alarm, allowing recursive loops to silently accumulate $10k+ in undetected costs.
complementsUnbounded Loop✕— Anti-pattern: run the agent loop without a step budget and let model self-termination decide.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

LLM APIコスト削減の落とし穴
blog

Provenance

Source: patterns/missing-max-tokens-cap.md on GitHub · commit 0f962e5 · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.