Mobile UI Agent
also known as Smartphone Agent, Mobile App Agent, Touch-UI Agent
Drive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.
Context
A team needs an agent to operate a mobile app on a real or emulated phone: a ride-hailing app, a food delivery app, a banking app, a Chinese super-app. The app exposes no public API and no clean web frontend that mirrors its functionality, so the only surface available is the touch user interface itself.
Problem
Mouse-and-keyboard action sets borrowed from desktop Computer Use do not match how phones are operated, and the DOM / accessibility tree abstractions used by browser agents do not exist for native mobile apps. Driving the phone purely as pixel coordinates without a touch-shaped action vocabulary leaves the agent reasoning one click at a time over coordinates, which is too low-level to plan with and brittle to screen size, theme, and locale changes.
Forces
- Mobile actions are touch-native, gesture-based, and screen-coordinate dependent.
- Per-app APIs do not exist; only the UI is available.
- Screen size is small; what fits on one screen does not generalise.
- Visual state is the source of truth, but text is what the model reasons in.
Example
A team tries to reuse their desktop computer-use agent on Android by injecting mouse-and-keyboard actions through ADB. The agent fights the touch interface, mistakes long-press menus for hover tooltips, and cannot find the back button. They rebuild as a mobile-ui-agent: a touch-native action vocabulary (tap, long-press, swipe, type, back, home), screenshots with extracted UI element annotations, and the model reasons about which element to act on instead of which pixel. The agent completes mobile flows like food ordering and ride-booking end to end.
Diagram
Solution
Therefore:
Define a touch-native action vocabulary (tap(x,y), long_press(x,y), swipe(dir), type(text), back, home). The agent receives a screenshot (optionally with extracted UI element annotations), reasons in text about which element to act on, emits an action call, and observes the next screenshot. Specialise the action vocabulary per platform (Android vs iOS) but keep the agent loop platform-agnostic.
What this pattern forbids. The agent may only emit actions in the registered touch-action vocabulary; arbitrary system or shell access is forbidden by construction.
The smaller patterns that complete this one —
- usesStructured Output★★— Constrain the model's output to conform to a JSON Schema (or similar typed shape).
And the patterns that stand alongside it, or against it —
- alternative-toComputer Use★— Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
- alternative-toBrowser Agent★— Expose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.
- complementsApp Exploration Phase·— Before deploying an agent against an opaque app, have it explore (or watch a human demonstrate) the app, generating a per-element documentation knowledge base; at deployment, retrieve element docs to ground actions.
- complementsDual-System GUI Agent★— Split a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.