Framework · Browser & Computer-Use

Mobile-Agent / GUI-Owl

Cross-platform multi-agent GUI automation framework (mobile / desktop / browser) built on the GUI-Owl native VLM family, with planning, progress management, reflection, and memory as distinct cooperating agents.

Description

Mobile-Agent is X-PLUG / Alibaba Qwen's GUI agent family. The v3 (2025) release became cross-platform — mobile, desktop, and browser — and shifted from a single-agent (v1) and mobile-only multi-agent (v2) design to a multi-agent framework over the GUI-Owl native VLM models (2B/4B/8B/32B/235B, built on Qwen3-VL). The framework decomposes a GUI task into planning, progress management, reflection, and memory roles, each backed by a specialised model call, and orchestrates them around grounded screenshot + DOM observation. Won the CCL 2025 best demo award. Distinct from the Western browser-first computer-use bias (Anthropic computer-use, OpenAI Operator) by being mobile-first and treating cross-platform grounding as a foundational model concern (GUI-Owl) rather than an integration layer.

Solution

Multi-agent loop on every GUI step. A Planner emits the next high-level action given task and history. A Grounder (the GUI-Owl VLM) localises the action target on the current screenshot. An Executor performs the action (tap, swipe, type, scroll). A Reflector reads the post-action screenshot to verify success and writes to Memory. Across turns, the Planner consumes Memory plus current state to choose the next action.

Primary use cases

  • mobile UI automation on Android with native screenshot + tap grounding
  • cross-platform GUI tasks spanning mobile, desktop, and browser in one framework
  • research baseline for VLM-as-grounder for GUI actions
  • evaluation against 20+ GUI benchmarks (state-of-the-art on most per published numbers)

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.