AppAgent
Type: full-code · Vendor: Tencent (TencentQQGYLab) · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2023-12-21
AppAgent is a multimodal agent framework that operates smartphone applications by observing the screen and performing human-like taps and swipes, after first exploring each app to build a per-element documentation base.
Description. AppAgent is an LLM-based multimodal agent framework designed to operate smartphone applications. It first runs an exploration phase, either exploring an app autonomously or learning from a human demonstration, and generates per-element documentation that it saves for later use. In the deployment phase it drives the app through a simplified action space that mimics human interactions such as tapping and swiping, using a vision-language model to read labeled screenshots. The approach bypasses the need for system back-end access, so it applies across diverse apps without per-app APIs.
Agent loop shape. In the exploration phase the agent either explores an app on its own or learns from a human demonstration, generating documentation for the elements it interacts with and saving it. In the deployment phase it observes labeled screenshots with a vision-language model, selects an action from a simplified action space, and performs human-like interactions such as tapping and swiping, consulting the documentation base it built earlier.
Primary use cases
- operating smartphone apps without backend APIs
- exploration-driven documentation of opaque app UIs
- GUI task automation on Android through taps and swipes
- vision-language control of mobile interfaces
Key concepts
- Exploration phase → app-exploration-phase (docs) — The first stage in which the agent autonomously explores an app or learns from a human demonstration and writes documentation for each UI element it interacts with, building the knowledge base used later.
- Deployment phase → mobile-ui-agent (docs) — The execution stage where the agent is given a task and a chosen documentation base, then carries out the task by reading labeled screenshots and acting on them.
- Numeric element labels → computer-use (docs) — Numeric tags overlaid on every interactive element in each captured screenshot, giving the vision-language model a stable handle to reference the element it wants to tap or swipe.
- Element documentation base → procedural-memory (docs) — The saved per-element notes produced during exploration that the agent loads in the deployment phase, so prior experience with an app's controls carries into task execution.
Patterns this full-code implements —
- ·App Exploration Phase
AppAgent runs an exploration phase (autonomous or learning from a human demo) that generates per-element documentation into a knowledge base, then a deployment phase that selects that documentation b…
- ★Mobile UI Agent
In the deployment phase AppAgent drives the phone through a simplified, touch-native action space (tap, swipe) over labeled screenshots, bypassing back-end APIs so it generalizes across apps.
- ★Computer Use
AppAgent perceives the device the way a user does: each step captures a screenshot, labels every interactive element with a numeric tag, and a vision-language model picks the next action over that an…
- ★★Reflection
During autonomous exploration AppAgent reflects on its previous action — checking that the action adhered to the given task — before writing per-element documentation, so each interaction is reviewed…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.