Framework · Browser & Computer-Use

AppAgent

AppAgent is a multimodal agent framework that operates smartphone applications by observing the screen and performing human-like taps and swipes, after first exploring each app to build a per-element documentation base.

Description

AppAgent is an LLM-based multimodal agent framework designed to operate smartphone applications. It first runs an exploration phase, either exploring an app autonomously or learning from a human demonstration, and generates per-element documentation that it saves for later use. In the deployment phase it drives the app through a simplified action space that mimics human interactions such as tapping and swiping, using a vision-language model to read labeled screenshots. The approach bypasses the need for system back-end access, so it applies across diverse apps without per-app APIs.

Solution

In the exploration phase the agent either explores an app on its own or learns from a human demonstration, generating documentation for the elements it interacts with and saving it. In the deployment phase it observes labeled screenshots with a vision-language model, selects an action from a simplified action space, and performs human-like interactions such as tapping and swiping, consulting the documentation base it built earlier.

Primary use cases

  • operating smartphone apps without backend APIs
  • exploration-driven documentation of opaque app UIs
  • GUI task automation on Android through taps and swipes
  • vision-language control of mobile interfaces

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.