III · Tool Use & EnvironmentEmerging

Dual-System GUI Agent

also known as Decision-Plus-Grounding, Planner-and-Vision Split, Two-Model GUI Agent

Split a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.

This pattern helps complete certain larger patterns —

  • specialisesComputer UseLet the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
  • specialisesBrowser AgentExpose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.

Context

A team is operating a long, multi-step GUI workflow with an agent: a web flow that involves filling forms across half a dozen pages, or a phone app sequence that books a ride, applies a coupon, and confirms payment. The task needs flexible high-level planning (when to back out, when to retry, what to do if the form looks different than expected) and at the same time precise pixel-accurate grounding of each click.

Problem

When one model does both planning and pixel grounding, it is dominated by whichever skill is hardest at the current step. A model strong at planning clicks the wrong menu item by a few pixels; a model strong at vision keeps trying to recover from a bad click locally instead of stepping back and replanning. Failures cannot be attributed cleanly either, since the same model is responsible for both deciding what to do and for executing it.

Forces

  • Planning skill and grounding skill are distinct in current models.
  • Two models cost more per turn but can be smaller per task.
  • Hand-off between models needs a clean intermediate representation.
  • Error recovery has to know which model to blame.

Example

A desktop-automation agent occasionally clicks the wrong menu item by a few pixels, and on other tasks plans well but loops endlessly trying to recover from a bad click. A single model is dominated by whichever skill is harder at the moment. The team splits it into a Dual-System GUI Agent: a strong planning model decides what to do and how to recover from errors, and a separate vision-grounding model translates 'click Save As' into the precise pixel coordinates. Each subproblem goes to the better-suited model.

Diagram

Solution

Therefore:

Define a clean intermediate representation: the decision model emits a high-level intent ("open the cart", "swipe left to next item") in a small, typed vocabulary; the grounding model receives that intent plus the current screenshot and emits the concrete action (tap(x,y), swipe coordinates, key press). The decision model holds the plan and replans on failure; the grounding model is stateless per action but specialised on screen interpretation. Errors at the grounding step are reported back to the decision model for replanning, not retried locally.

What this pattern forbids. The decision model may not emit pixel-level actions; the grounding model may not change the plan or invent intents outside the typed vocabulary.

The smaller patterns that complete this one —

  • usesMulti-Model Routing★★Send each request to the cheapest model that can handle it well.
  • usesStructured Output★★Constrain the model's output to conform to a JSON Schema (or similar typed shape).
  • generalisesPolicy-Localizer-ValidatorSplit a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.

And the patterns that stand alongside it, or against it —

  • complementsMobile UI AgentDrive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.
  • alternative-toTalker-ReasonerSplit an interactive agent into a fast Talker for conversational responses and a slow Reasoner for deliberative planning and tool use, so the conversational loop never blocks on reasoning.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.