Full-Code · Browser & Computer-Useactive

AppAgent

Type: full-code · Vendor: Tencent (TencentQQGYLab) · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2023-12-21

Links: homepage docs repo

AppAgent is a multimodal agent framework that operates smartphone applications by observing the screen and performing human-like taps and swipes, after first exploring each app to build a per-element documentation base.

Description. AppAgent is an LLM-based multimodal agent framework designed to operate smartphone applications. It first runs an exploration phase, either exploring an app autonomously or learning from a human demonstration, and generates per-element documentation that it saves for later use. In the deployment phase it drives the app through a simplified action space that mimics human interactions such as tapping and swiping, using a vision-language model to read labeled screenshots. The approach bypasses the need for system back-end access, so it applies across diverse apps without per-app APIs.

Agent loop shape. In the exploration phase the agent either explores an app on its own or learns from a human demonstration, generating documentation for the elements it interacts with and saving it. In the deployment phase it observes labeled screenshots with a vision-language model, selects an action from a simplified action space, and performs human-like interactions such as tapping and swiping, consulting the documentation base it built earlier.

Primary use cases

  • operating smartphone apps without backend APIs
  • exploration-driven documentation of opaque app UIs
  • GUI task automation on Android through taps and swipes
  • vision-language control of mobile interfaces

Key concepts

  • Exploration phase app-exploration-phase (docs)The first stage in which the agent autonomously explores an app or learns from a human demonstration and writes documentation for each UI element it interacts with, building the knowledge base used later.
  • Deployment phase mobile-ui-agent (docs)The execution stage where the agent is given a task and a chosen documentation base, then carries out the task by reading labeled screenshots and acting on them.
  • Numeric element labels computer-use (docs)Numeric tags overlaid on every interactive element in each captured screenshot, giving the vision-language model a stable handle to reference the element it wants to tap or swipe.
  • Element documentation base procedural-memory (docs)The saved per-element notes produced during exploration that the agent loads in the deployment phase, so prior experience with an app's controls carries into task execution.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.