Computer Use

also known as Desktop Agent, GUI Agent, Screen Control

Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.

Context

A team needs an agent to drive a desktop application or chain together work across several apps that have no public API and no plug-in integration: a legacy accounting suite, an internal CRM, a remote desktop, a custom Windows utility. The agent has to operate exactly the same screen, mouse, and keyboard a human would.

Problem

Building a bespoke integration for every target application takes weeks per app and has to be redone the moment the vendor changes a screen. Most enterprise software has no API at all, or only an API that covers a fraction of what users actually do in the UI. Without a way to drive the screen visually, the agent simply cannot reach those applications, and per-app integration work scales linearly with the surface area the agent is expected to cover.

Forces

Latency and reliability are open problems.
Prompt injection via on-screen content is a real attack surface.
Cost: every step pays vision tokens.

Example

A solo founder wants their agent to update a spreadsheet in a desktop accounting app that has no API and no plug-ins. Building a bespoke integration would take weeks and they'd need to do it again for the next tool. They put the agent on Computer Use: it receives screenshots of the desktop and emits virtual mouse and keyboard actions to navigate menus, click cells, and type. Clunkier and slower than an API, but it works on the software the founder actually owns.

Diagram

sequenceDiagram participant Model participant Ctrl as Controller participant Desktop loop until done Desktop-->>Ctrl: screenshot Ctrl-->>Model: image (+ a11y tree) Model->>Ctrl: click / type / scroll Ctrl->>Desktop: virtual mouse / keyboard end

Solution

Therefore:

The model receives screenshots (optionally augmented with accessibility-tree or set-of-mark annotations) and emits typed tool calls (move mouse, click, type, scroll, screenshot). A controller executes them against a real or virtual desktop. The loop is ReAct-shaped: screenshot → think → act → screenshot.

What this pattern forbids. The agent operates the desktop only through the typed action vocabulary; arbitrary code execution is not part of this surface.

The smaller patterns that complete this one —

usesReAct★★— Interleave a single thought, a single tool call, and a single observation per step so the agent reasons over fresh evidence.
generalisesDual-System GUI Agent★— Split a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.
generalisesPolicy-Localizer-Validator★— Split a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.
generalisesFull-Desktop Computer Use★— Give the agent a complete containerized OS desktop with native apps, a persistent filesystem, and desktop credential stores, so it can finish multi-application workflows a browser-only surface cannot.

And the patterns that stand alongside it, or against it —

alternative-toBrowser Agent★— Expose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.
complementsInput/Output Guardrails★★— Validate inputs before they reach the model and outputs before they reach the user.
alternative-toMobile UI Agent★— Drive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.
alternative-toMultilingual Voice Agent Stack★— Compose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.
complementsProactive Goal Creator★— Anticipate the user's goal by capturing surrounding multimodal context (gestures, screen state, environment) in addition to what the user types or says.
complementsLarge Action Models (LAMs)·— Use a model class specifically trained for action execution (tool calls, UI navigation, workflow steps) rather than text generation, when the workload is dominated by reliably completing actions in real systems.
complementsMagentic-One Generalist Multi-Agent★— Use Microsoft's generalist multi-agent architecture: a single Orchestrator agent dispatches to four specialist sub-agents (WebSurfer, FileSurfer, Coder, ComputerTerminal) for solving open-ended complex tasks that span web browsing, file manipulation, code execution and shell operations.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Browser & Computer-Use Stack
core

Used in frameworks

Show 13 more

References

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
blog

Provenance

Source: patterns/computer-use.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.