Mobile-Agent / GUI-Owl
Type: full-code · Vendor: Alibaba Qwen / X-PLUG · Language: Python · License: Apache-2.0 · Status: active · Status in practice: emerging
Cross-platform multi-agent GUI automation framework (mobile / desktop / browser) built on the GUI-Owl native VLM family, with planning, progress management, reflection, and memory as distinct cooperating agents.
Description. Mobile-Agent is X-PLUG / Alibaba Qwen's GUI agent family. The v3 (2025) release became cross-platform — mobile, desktop, and browser — and shifted from a single-agent (v1) and mobile-only multi-agent (v2) design to a multi-agent framework over the GUI-Owl native VLM models (2B/4B/8B/32B/235B, built on Qwen3-VL). The framework decomposes a GUI task into planning, progress management, reflection, and memory roles, each backed by a specialised model call, and orchestrates them around grounded screenshot + DOM observation. Won the CCL 2025 best demo award. Distinct from the Western browser-first computer-use bias (Anthropic computer-use, OpenAI Operator) by being mobile-first and treating cross-platform grounding as a foundational model concern (GUI-Owl) rather than an integration layer.
Agent loop shape. Multi-agent loop on every GUI step. A Planner emits the next high-level action given task and history. A Grounder (the GUI-Owl VLM) localises the action target on the current screenshot. An Executor performs the action (tap, swipe, type, scroll). A Reflector reads the post-action screenshot to verify success and writes to Memory. Across turns, the Planner consumes Memory plus current state to choose the next action.
Primary use cases
- mobile UI automation on Android with native screenshot + tap grounding
- cross-platform GUI tasks spanning mobile, desktop, and browser in one framework
- research baseline for VLM-as-grounder for GUI actions
- evaluation against 20+ GUI benchmarks (state-of-the-art on most per published numbers)
Key concepts
- GUI-Owl native VLM family (docs) — Foundation VLM (Qwen3-VL based, 2B-235B) trained natively for GUI grounding and action.
- Planner / Grounder / Executor / Reflector → hierarchical-agents — Four-role decomposition per GUI step.
- Cross-platform unified abstraction — Single framework spans mobile, desktop, browser environments.
- Progress + memory roles → episodic-memory — Dedicated agents track multi-step progress and persist relevant observations.
Patterns this full-code implements —
- ★Computer Use
- ★Dual-System GUI Agent
Planner over Grounder/Executor is a dual-system shape.
- ★★Hierarchical Agents
- ★★Orchestrator-Workers
Planner orchestrates Grounder/Executor/Reflector workers.
- ★★ReAct
Reflector closes the observation loop.
- ★★Reflection
Reflector is a named agent.
- ★★Tool Use
- ★★Structured Output
- ★★Plan-and-Execute
- ★Browser Agent
- ★★Episodic Memory
Memory role tracks observations and outcomes.
- ★★Vector Memory
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.