III · Tool Use & EnvironmentEmerging

Computer Use

also known as Desktop Agent, GUI Agent, Screen Control

Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.

Context

A team needs an agent to drive a desktop application or chain together work across several apps that have no public API and no plug-in integration: a legacy accounting suite, an internal CRM, a remote desktop, a custom Windows utility. The agent has to operate exactly the same screen, mouse, and keyboard a human would.

Problem

Building a bespoke integration for every target application takes weeks per app and has to be redone the moment the vendor changes a screen. Most enterprise software has no API at all, or only an API that covers a fraction of what users actually do in the UI. Without a way to drive the screen visually, the agent simply cannot reach those applications, and per-app integration work scales linearly with the surface area the agent is expected to cover.

Forces

  • Latency and reliability are open problems.
  • Prompt injection via on-screen content is a real attack surface.
  • Cost: every step pays vision tokens.

Example

A solo founder wants their agent to update a spreadsheet in a desktop accounting app that has no API and no plug-ins. Building a bespoke integration would take weeks and they'd need to do it again for the next tool. They put the agent on Computer Use: it receives screenshots of the desktop and emits virtual mouse and keyboard actions to navigate menus, click cells, and type. Clunkier and slower than an API, but it works on the software the founder actually owns.

Diagram

Solution

Therefore:

The model receives screenshots (optionally augmented with accessibility-tree or set-of-mark annotations) and emits typed tool calls (move mouse, click, type, scroll, screenshot). A controller executes them against a real or virtual desktop. The loop is ReAct-shaped: screenshot → think → act → screenshot.

What this pattern forbids. The agent operates the desktop only through the typed action vocabulary; arbitrary code execution is not part of this surface.

The smaller patterns that complete this one —

  • usesReAct★★Interleave a single thought, a single tool call, and a single observation per step so the agent reasons over fresh evidence.
  • generalisesDual-System GUI AgentSplit a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.
  • generalisesPolicy-Localizer-ValidatorSplit a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.

And the patterns that stand alongside it, or against it —

  • alternative-toBrowser AgentExpose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.
  • complementsInput/Output Guardrails★★Validate inputs before they reach the model and outputs before they reach the user.
  • alternative-toMobile UI AgentDrive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.
  • alternative-toMultilingual Voice Agent StackCompose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.
  • complementsProactive Goal CreatorAnticipate the user's goal by capturing surrounding multimodal context (gestures, screen state, environment) in addition to what the user types or says.
  • complementsLarge Action Models (LAMs)·Use a model class specifically trained for action execution (tool calls, UI navigation, workflow steps) rather than text generation, when the workload is dominated by reliably completing actions in real systems.
  • complementsMagentic-One Generalist Multi-AgentUse Microsoft's generalist multi-agent architecture: a single Orchestrator agent dispatches to four specialist sub-agents (WebSurfer, FileSurfer, Coder, ComputerTerminal) for solving open-ended complex tasks that span web browsing, file manipulation, code execution and shell operations.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Used in frameworks

Show 9 more

References

Provenance