AppAgent
AppAgent is a multimodal agent framework that operates smartphone applications by observing the screen and performing human-like taps and swipes, after first exploring each app to build a per-element documentation base.
Description
AppAgent is an LLM-based multimodal agent framework designed to operate smartphone applications. It first runs an exploration phase, either exploring an app autonomously or learning from a human demonstration, and generates per-element documentation that it saves for later use. In the deployment phase it drives the app through a simplified action space that mimics human interactions such as tapping and swiping, using a vision-language model to read labeled screenshots. The approach bypasses the need for system back-end access, so it applies across diverse apps without per-app APIs.
Solution
In the exploration phase the agent either explores an app on its own or learns from a human demonstration, generating documentation for the elements it interacts with and saving it. In the deployment phase it observes labeled screenshots with a vision-language model, selects an action from a simplified action space, and performs human-like interactions such as tapping and swiping, consulting the documentation base it built earlier.
Primary use cases
- operating smartphone apps without backend APIs
- exploration-driven documentation of opaque app UIs
- GUI task automation on Android through taps and swipes
- vision-language control of mobile interfaces
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.