Full-Code · Browser & Computer-Useactive

Mobile-Agent / GUI-Owl

Type: full-code  ·  Vendor: Alibaba Qwen / X-PLUG  ·  Language: Python  ·  License: Apache-2.0  ·  Status: active  ·  Status in practice: emerging

Links: homepage repo

Cross-platform multi-agent GUI automation framework (mobile / desktop / browser) built on the GUI-Owl native VLM family, with planning, progress management, reflection, and memory as distinct cooperating agents.

Description. Mobile-Agent is X-PLUG / Alibaba Qwen's GUI agent family. The v3 (2025) release became cross-platform — mobile, desktop, and browser — and shifted from a single-agent (v1) and mobile-only multi-agent (v2) design to a multi-agent framework over the GUI-Owl native VLM models (2B/4B/8B/32B/235B, built on Qwen3-VL). The framework decomposes a GUI task into planning, progress management, reflection, and memory roles, each backed by a specialised model call, and orchestrates them around grounded screenshot + DOM observation. Won the CCL 2025 best demo award. Distinct from the Western browser-first computer-use bias (Anthropic computer-use, OpenAI Operator) by being mobile-first and treating cross-platform grounding as a foundational model concern (GUI-Owl) rather than an integration layer.

Agent loop shape. Multi-agent loop on every GUI step. A Planner emits the next high-level action given task and history. A Grounder (the GUI-Owl VLM) localises the action target on the current screenshot. An Executor performs the action (tap, swipe, type, scroll). A Reflector reads the post-action screenshot to verify success and writes to Memory. Across turns, the Planner consumes Memory plus current state to choose the next action.

Primary use cases

  • mobile UI automation on Android with native screenshot + tap grounding
  • cross-platform GUI tasks spanning mobile, desktop, and browser in one framework
  • research baseline for VLM-as-grounder for GUI actions
  • evaluation against 20+ GUI benchmarks (state-of-the-art on most per published numbers)

Key concepts

  • GUI-Owl native VLM family (docs)Foundation VLM (Qwen3-VL based, 2B-235B) trained natively for GUI grounding and action.
  • Planner / Grounder / Executor / Reflector hierarchical-agentsFour-role decomposition per GUI step.
  • Cross-platform unified abstractionSingle framework spans mobile, desktop, browser environments.
  • Progress + memory roles episodic-memoryDedicated agents track multi-step progress and persist relevant observations.

Patterns this full-code implements

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.