Tool Use & Environment

Computer Use

Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.

Problem

Building a bespoke integration for every target application takes weeks per app and has to be redone the moment the vendor changes a screen. Most enterprise software has no API at all, or only an API that covers a fraction of what users actually do in the UI. Without a way to drive the screen visually, the agent simply cannot reach those applications, and per-app integration work scales linearly with the surface area the agent is expected to cover.

Solution

The model receives screenshots (optionally augmented with accessibility-tree or set-of-mark annotations) and emits typed tool calls (move mouse, click, type, scroll, screenshot). A controller executes them against a real or virtual desktop. The loop is ReAct-shaped: screenshot → think → act → screenshot.

When to use

  • The target software has no clean API and the agent must drive a real desktop visually.
  • Screenshots plus virtual mouse/keyboard tool calls fit the target environment.
  • The vendor exposes a model with sufficient screen-grounding capability.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related