Tool-Output Arithmetic Trust

also known as Tool Output Processing Failure, In-Head Aggregation Over Tool Data

Anti-pattern: the agent compares, ranks, or sums correctly returned tool data in its own head instead of offloading the computation to a deterministic tool, emitting confident wrong aggregates.

Context

An agent gathers data through tools — search hits with scores, rows of prices, durations, line items, or counts — and then has to combine those values to answer the user. The tool returns the data correctly; the remaining work is ordinary computation over it, such as finding the cheapest option, ranking results, summing a column, or comparing two totals. Because the data is already in the context window, treating the next step as more free-form text feels natural to the model.

Problem

Token-by-token generation is not arithmetic. When the model performs comparison, ranking, or addition over tool data inside its own reasoning rather than in a deterministic step, it produces answers that read as authoritative but are numerically wrong: a mis-sorted ranking, a total that is off, a wrong cheapest pick. The data was right and the tool was right, so nothing in the trace flags the error, and the confident wrong aggregate flows straight to the user or into the next decision.

Forces

The tool already returned the values into context, so re-using a separate compute step feels redundant even though free-form generation is unreliable at exact arithmetic.
Small inputs (a handful of rows) look easy enough to do in-head, but error rate rises silently with the number of items and the depth of the comparison.
A wrong aggregate is indistinguishable in tone from a right one; there is no refusal or error to catch it, so the failure is silent.
Forcing every comparison through a deterministic tool adds a call and a round-trip the agent would rather skip.

Example

A travel agent calls a flights tool that correctly returns five fares. Asked for the cheapest, the agent eyeballs the list in its reasoning and confidently names a flight that is not actually the lowest fare. The tool was right and the data was right, so nothing flags the mistake, and the user books a pricier flight than they were told was cheapest.

Diagram

flowchart TD T[Tool returns correct data] --> Q{How is the aggregate computed?} Q -->|ANTI: model computes in-head| H[Free-form generation over values] H --> W[Confident wrong total / ranking] Q -->|FIX: offload| D[Deterministic tool: calculator / sandbox / sort / query] D --> R[Read back computed result] R --> A[Verified aggregate reported]

Solution

Therefore:

The corrected stance is to route every aggregate over tool data through a deterministic step rather than the model's free-form output. After a tool returns rows, the agent passes them to a calculator, a code-execution sandbox, a sort or filter primitive, or a query, and reads back the computed result; the model's job is to choose what to compute and how to phrase the answer, not to be the adder or the comparator. Where a single deterministic step is impractical, the aggregate is at least recomputed and cross-checked before it is reported, so a numeric claim never rests solely on token generation.

What this pattern forbids. Avoiding it forbids the agent from computing aggregates over tool data itself: comparison, ranking, and arithmetic must be offloaded to a deterministic tool, and a numeric claim is never reported until it has been read back from that computation.

The patterns that counter or replace it —

alternative-toCode Execution★★— Let the model emit code, run it in a sandbox, and treat the run as the answer instead of trusting the model to compute in its head.
alternative-toMRKL Systems (Modular Neuro-Symbolic)★★— Route each request through an LLM dispatcher to specialized symbolic or neural expert modules (calculator, knowledge base, code executor) rather than asking one LLM to do everything; integrate the modules' results for the final response.
complementsTool Output Trusted Verbatim✕— Anti-pattern: trust whatever tools return without validation, schema enforcement, or trust labels.
complementsPremature Closure✕— The LLM commits to a confident answer before processing all constraints, characteristic of constraint-heavy tasks where it fills in plausible answers fast and gets cross-constraint interactions wrong.
complementsFalse Confidence Syndrome✕— Anti-pattern: the model produces incorrect answers with the same high confidence as correct ones, failing to vary its expressed certainty with its actual reliability — Oxford-documented for constraint-heavy prompts.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

References

Provenance

Source: patterns/tool-output-arithmetic-trust.md on GitHub · commit ad426c4 · view history
Added to catalog: 2026-06-14
Last updated: 2026-06-14
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.