Full-Code · Browser & Computer-Useactive

AppAgent

Type: full-code · Vendor: Tencent (TencentQQGYLab) · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2023-12-21

Links: homepage docs repo

AppAgent is a multimodal agent framework that operates smartphone applications by observing the screen and performing human-like taps and swipes, after first exploring each app to build a per-element documentation base.

Description. AppAgent is an LLM-based multimodal agent framework designed to operate smartphone applications. It first runs an exploration phase, either exploring an app autonomously or learning from a human demonstration, and generates per-element documentation that it saves for later use. In the deployment phase it drives the app through a simplified action space that mimics human interactions such as tapping and swiping, using a vision-language model to read labeled screenshots. The approach bypasses the need for system back-end access, so it applies across diverse apps without per-app APIs.

Agent loop shape. In the exploration phase the agent either explores an app on its own or learns from a human demonstration, generating documentation for the elements it interacts with and saving it. In the deployment phase it observes labeled screenshots with a vision-language model, selects an action from a simplified action space, and performs human-like interactions such as tapping and swiping, consulting the documentation base it built earlier.

Primary use cases

operating smartphone apps without backend APIs
exploration-driven documentation of opaque app UIs
GUI task automation on Android through taps and swipes
vision-language control of mobile interfaces

flowchart TD fw["AppAgent"] fw --> p1["App Exploration Phase<br/>(core)"] fw --> p2["Mobile UI Agent<br/>(core)"] fw --> p3["Computer Use<br/>(supported)"] fw --> p4["Reflection<br/>(supported)"]

Key concepts

Exploration phase → app-exploration-phase (docs) — The first stage in which the agent autonomously explores an app or learns from a human demonstration and writes documentation for each UI element it interacts with, building the knowledge base used later.
Deployment phase → mobile-ui-agent (docs) — The execution stage where the agent is given a task and a chosen documentation base, then carries out the task by reading labeled screenshots and acting on them.
Numeric element labels → computer-use (docs) — Numeric tags overlaid on every interactive element in each captured screenshot, giving the vision-language model a stable handle to reference the element it wants to tap or swipe.
Element documentation base → procedural-memory (docs) — The saved per-element notes produced during exploration that the agent loads in the deployment phase, so prior experience with an app's controls carries into task execution.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.

Alternatives & relatives

References

Provenance

Last analyzed: 2026-06-17
Last updated: 2026-06-17
Verification status: partial