Full-Code · Browser & Computer-Useactive

Mobile-Agent / GUI-Owl

Type: full-code · Vendor: Alibaba Qwen / X-PLUG · Language: Python · License: Apache-2.0 · Status: active · Status in practice: emerging

Links: homepage repo

Cross-platform multi-agent GUI automation framework (mobile / desktop / browser) built on the GUI-Owl native VLM family, with planning, progress management, reflection, and memory as distinct cooperating agents.

Description. Mobile-Agent is X-PLUG / Alibaba Qwen's GUI agent family. The v3 (2025) release became cross-platform — mobile, desktop, and browser — and shifted from a single-agent (v1) and mobile-only multi-agent (v2) design to a multi-agent framework over the GUI-Owl native VLM models (2B/4B/8B/32B/235B, built on Qwen3-VL). The framework decomposes a GUI task into planning, progress management, reflection, and memory roles, each backed by a specialised model call, and orchestrates them around grounded screenshot + DOM observation. Won the CCL 2025 best demo award. Distinct from the Western browser-first computer-use bias (Anthropic computer-use, OpenAI Operator) by being mobile-first and treating cross-platform grounding as a foundational model concern (GUI-Owl) rather than an integration layer.

Agent loop shape. Multi-agent loop on every GUI step. A Planner emits the next high-level action given task and history. A Grounder (the GUI-Owl VLM) localises the action target on the current screenshot. An Executor performs the action (tap, swipe, type, scroll). A Reflector reads the post-action screenshot to verify success and writes to Memory. Across turns, the Planner consumes Memory plus current state to choose the next action.

Primary use cases

mobile UI automation on Android with native screenshot + tap grounding
cross-platform GUI tasks spanning mobile, desktop, and browser in one framework
research baseline for VLM-as-grounder for GUI actions
evaluation against 20+ GUI benchmarks (state-of-the-art on most per published numbers)

flowchart TD TASK[GUI task] --> PLANNER[Planner agent next high-level action] PLANNER --> GROUNDER[Grounder agent GUI-Owl VLM] GROUNDER --> LOC[Localise target on current screenshot] LOC --> EXECUTOR[Executor agent tap / swipe / type / scroll] EXECUTOR --> ENV[(Mobile / desktop / browser cross-platform)] ENV --> SHOT[Post-action screenshot] SHOT --> REFLECTOR[Reflector agent verify success] REFLECTOR --> MEM[(Memory observations + outcomes)] MEM --> PLANNER REFLECTOR -->|fail| PLANNER REFLECTOR -->|done| RESULT[Task complete]

Key concepts

GUI-Owl native VLM family (docs) — Foundation VLM (Qwen3-VL based, 2B-235B) trained natively for GUI grounding and action.
Planner / Grounder / Executor / Reflector → hierarchical-agents — Four-role decomposition per GUI step.
Cross-platform unified abstraction — Single framework spans mobile, desktop, browser environments.
Progress + memory roles → episodic-memory — Dedicated agents track multi-step progress and persist relevant observations.

Mobile-Agent / GUI-Owl

Neighbourhood

Alternatives & relatives

Listed as alternative by (5)

References

Provenance