Dual-System GUI Agent

also known as Decision-Plus-Grounding, Planner-and-Vision Split, Two-Model GUI Agent

Split a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.

This pattern helps complete certain larger patterns —

specialisesComputer Use★— Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
specialisesBrowser Agent★— Expose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.

Context

A team is operating a long, multi-step GUI workflow with an agent: a web flow that involves filling forms across half a dozen pages, or a phone app sequence that books a ride, applies a coupon, and confirms payment. The task needs flexible high-level planning (when to back out, when to retry, what to do if the form looks different than expected) and at the same time precise pixel-accurate grounding of each click.

Problem

When one model does both planning and pixel grounding, it is dominated by whichever skill is hardest at the current step. A model strong at planning clicks the wrong menu item by a few pixels; a model strong at vision keeps trying to recover from a bad click locally instead of stepping back and replanning. Failures cannot be attributed cleanly either, since the same model is responsible for both deciding what to do and for executing it.

Forces

Planning skill and grounding skill are distinct in current models.
Two models cost more per turn but can be smaller per task.
Hand-off between models needs a clean intermediate representation.
Error recovery has to know which model to blame.

Example

A desktop-automation agent occasionally clicks the wrong menu item by a few pixels, and on other tasks plans well but loops endlessly trying to recover from a bad click. A single model is dominated by whichever skill is harder at the moment. The team splits it into a Dual-System GUI Agent: a strong planning model decides what to do and how to recover from errors, and a separate vision-grounding model translates 'click Save As' into the precise pixel coordinates. Each subproblem goes to the better-suited model.

Diagram

sequenceDiagram participant Dec as Decision Model participant Gnd as Grounding Model participant GUI GUI-->>Dec: state Dec->>Gnd: high-level intent (typed) Gnd->>GUI: precise pixel/element action GUI-->>Dec: new state Dec->>Dec: plan / recover from errors

Solution

Therefore:

Define a clean intermediate representation: the decision model emits a high-level intent ("open the cart", "swipe left to next item") in a small, typed vocabulary; the grounding model receives that intent plus the current screenshot and emits the concrete action (tap(x,y), swipe coordinates, key press). The decision model holds the plan and replans on failure; the grounding model is stateless per action but specialised on screen interpretation. Errors at the grounding step are reported back to the decision model for replanning, not retried locally.

What this pattern forbids. The decision model may not emit pixel-level actions; the grounding model may not change the plan or invent intents outside the typed vocabulary.

The smaller patterns that complete this one —

usesMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
usesStructured Output★★— Constrain the model's output to conform to a JSON Schema (or similar typed shape).
generalisesPolicy-Localizer-Validator★— Split a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.

And the patterns that stand alongside it, or against it —

complementsMobile UI Agent★— Drive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.
alternative-toTalker-Reasoner★— Split an interactive agent into a fast Talker for conversational responses and a slow Reasoner for deliberative planning and tool use, so the conversational loop never blocks on reasoning.
alternative-toTwo-Rate Cloud-Brain / Edge-Controller Split·— Run a slow planner at low frequency that emits a compact latent plan, and a small on-device controller that tracks it at the robot's native control rate without ever blocking on the planner.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Browser & Computer-Use Stack
hardening

Used in frameworks

References

Provenance

Source: patterns/dual-system-gui-agent.md on GitHub · commit 4fa1213 · view history
Added to catalog: 2026-04-30
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.