Tool Use & Environment

Dual-System GUI Agent

Split a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.

Problem

When one model does both planning and pixel grounding, it is dominated by whichever skill is hardest at the current step. A model strong at planning clicks the wrong menu item by a few pixels; a model strong at vision keeps trying to recover from a bad click locally instead of stepping back and replanning. Failures cannot be attributed cleanly either, since the same model is responsible for both deciding what to do and for executing it.

Solution

Define a clean intermediate representation: the decision model emits a high-level intent ("open the cart", "swipe left to next item") in a small, typed vocabulary; the grounding model receives that intent plus the current screenshot and emits the concrete action (tap(x,y), swipe coordinates, key press). The decision model holds the plan and replans on failure; the grounding model is stateless per action but specialised on screen interpretation. Errors at the grounding step are reported back to the decision model for replanning, not retried locally.

When to use

  • A single GUI model is dominated by either planning or grounding and underperforms on the other skill.
  • A clean intermediate vocabulary (open the cart, swipe left to next item) can express decisions for grounding.
  • Two specialised models (decision and grounding) are available and routing between them is feasible.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related