III · Tool Use & EnvironmentMature★★

Code Execution

also known as Code-Then-Execute, CodeAct, Program of Thoughts

Let the model emit code, run it in a sandbox, and treat the run as the answer instead of trusting the model to compute in its head.

This pattern helps complete certain larger patterns —

  • specialisesTool Use★★Let the LLM produce typed calls against an external toolkit instead of producing free-form text the surrounding system has to parse.
  • used-byCode-as-Action AgentHave the agent emit a code snippet as its action each step, executed in a constrained interpreter, instead of emitting JSON tool calls; tool composition becomes function nesting and control flow inside the snippet.

Context

A team is building an agent for a task that involves arithmetic, data manipulation, parsing, or other deterministic computation. The deployment can host a sandboxed Python or JavaScript interpreter (or another container-based execution environment) that the agent's code blocks can run inside.

Problem

Large language models routinely get arithmetic wrong, miscount items in a list, and round numbers inconsistently when they try to compute the answer in their head. A small numeric error early in a workflow invalidates every downstream step, and the model offers no audit trail for how it arrived at a wrong number. Asking the model to be more careful does not fix the underlying issue: the computation never becomes a step the model can rerun or inspect.

Forces

  • Sandbox setup adds latency.
  • Generated code may import unsafe modules or run forever.
  • Execution results must round-trip back into the model's working context.

Example

A finance agent answers 'what was the average gross margin across these 47 orders?' by reading the rows and trying to compute the answer in its head, getting it wrong by 1.4 percentage points. The team enables Code Execution: the agent emits a short Python snippet that loads the data and computes the average in a sandbox, and the run's stdout becomes the answer. The model's strength stays at constructing the right calculation; the arithmetic stops being something it has to hallucinate.

Diagram

Solution

Therefore:

The agent emits a code block; a controlled interpreter (Python sandbox, JS VM, container) runs it; stdout/stderr/return value flow back. Repeat under a step budget. CodeAct treats code as the action language directly.

What this pattern forbids. Computation happens in the sandbox; the model's free-form numeric output is not trusted.

And the patterns that stand alongside it, or against it —

  • composes-withReAct★★Interleave a single thought, a single tool call, and a single observation per step so the agent reasons over fresh evidence.
  • composes-withDeterministic-LLM SandwichBracket every LLM call with deterministic checks on both sides.
  • composes-withSkill LibraryLet the agent grow its own toolkit by writing reusable skills that subsequent runs can call.
  • complementsSandbox Isolation★★Run agent-emitted code or actions in a contained environment with restricted filesystem, network, and process privileges.
  • complementsWebAssembly Skill Runtime·Package each agent skill as a WebAssembly module with a capability manifest, and run it inside a Wasm runtime that enforces those capabilities, so untrusted skills cannot weaken the host's sandbox.
  • complementsCode-Then-Execute with Dataflow AnalysisHave the agent emit code in a sandbox DSL whose values are statically tagged trusted/tainted via dataflow analysis before execution, enabling per-value policy enforcement.
  • complementsVibe-Coding Without Security ReviewAnti-pattern: developer scaffolds an agent prototype with a code-generation tool and ships the generated code with no security review; ~90% of agent-generated code contains vulnerabilities without explicit security prompts.
  • complementsRecursive Language Model·Treat an over-long prompt as an environment the model navigates by code, letting it partition and recursively call itself over snippets, so it answers over inputs far larger than its context window.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Used in frameworks

Show 20 more

References

Provenance