Tool Use & Environment

Code Execution

Let the model emit code, run it in a sandbox, and treat the run as the answer instead of trusting the model to compute in its head.

Problem

Large language models routinely get arithmetic wrong, miscount items in a list, and round numbers inconsistently when they try to compute the answer in their head. A small numeric error early in a workflow invalidates every downstream step, and the model offers no audit trail for how it arrived at a wrong number. Asking the model to be more careful does not fix the underlying issue: the computation never becomes a step the model can rerun or inspect.

Solution

The agent emits a code block; a controlled interpreter (Python sandbox, JS VM, container) runs it; stdout/stderr/return value flow back. Repeat under a step budget. CodeAct treats code as the action language directly.

When to use

  • The task involves calculation, parsing, or transformations that LLMs hallucinate.
  • A controlled interpreter or sandbox is available and trusted enough to run model-emitted code.
  • stdout, stderr, and return values can flow back to the agent under a step budget.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related