Tool Use & Environment

Mobile UI Agent

Drive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.

Problem

Mouse-and-keyboard action sets borrowed from desktop Computer Use do not match how phones are operated, and the DOM / accessibility tree abstractions used by browser agents do not exist for native mobile apps. Driving the phone purely as pixel coordinates without a touch-shaped action vocabulary leaves the agent reasoning one click at a time over coordinates, which is too low-level to plan with and brittle to screen size, theme, and locale changes.

Solution

Define a touch-native action vocabulary (tap(x,y), long_press(x,y), swipe(dir), type(text), back, home). The agent receives a screenshot (optionally with extracted UI element annotations), reasons in text about which element to act on, emits an action call, and observes the next screenshot. Specialise the action vocabulary per platform (Android vs iOS) but keep the agent loop platform-agnostic.

When to use

  • The target environment is a smartphone where touch is the only useful input.
  • Desktop Computer Use or Browser Agent action sets are the wrong shape for the task.
  • A small touch-native vocabulary (tap, swipe, type, back, home) covers the workflow.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related