III · Tool Use & EnvironmentEmerging

Browser Agent

also known as Web Agent, Browser Automation Agent

Expose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.

This pattern helps complete certain larger patterns —

  • specialisesTool Use★★Let the LLM produce typed calls against an external toolkit instead of producing free-form text the surrounding system has to parse.

Context

A team needs an agent that operates websites end-to-end: filling forms, pulling competitive data, navigating multi-page checkouts, or running research across many sites. The target sites have no clean API the team can integrate with, and pixel-level screen control (the Computer Use approach) is too slow and brittle for routine web work.

Problem

Raw HTML is full of inline scripts, tracking pixels, and minified CSS that overwhelm the context window before the agent reaches the actual content. Treating the browser as pure pixels and driving the mouse to coordinates is slow, breaks the moment the layout shifts, and burns vision tokens on every click. Without a stable, structured representation of the page the agent ends up reasoning over noise instead of intent.

Forces

  • DOM extraction needs a stable representation across sites.
  • Action vocabulary completeness vs simplicity.
  • Anti-bot measures break agent flows.

Example

A growth team builds an agent that scrapes competitor pricing pages. Feeding raw HTML overflows context with tracking scripts and inline CSS; pixel-level Computer Use is overkill for clicking through five filters. They settle on a Browser Agent surface: the page is reduced to a structured DOM/accessibility tree of interactable elements, and the agent emits actions from a small vocabulary like click(id) and type(id, text). The model spends its tokens on intent, not on parsing minified script tags.

Diagram

Solution

Therefore:

A library (Playwright-backed) exposes structured page state (numbered interactive elements, accessibility tree) and a compact action set (click, type, scroll, navigate). The agent reasons over the structured state and emits actions; the library executes them.

What this pattern forbids. Actions are limited to the typed vocabulary; arbitrary JavaScript execution is not part of this surface.

The smaller patterns that complete this one —

  • generalisesDual-System GUI AgentSplit a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.
  • generalisesPolicy-Localizer-ValidatorSplit a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.

And the patterns that stand alongside it, or against it —

  • alternative-toComputer UseLet the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
  • complementsTool Output Poisoning DefenseTreat tool output as untrusted content and apply instruction-stripping plus per-tool trust labels.
  • alternative-toMobile UI AgentDrive a smartphone end-to-end through a small, touch-native action vocabulary (tap, long-press, swipe, type, back, home) over screenshots, as a distinct interaction surface from desktop Computer Use and from web Browser Agents.
  • complementsMagentic-One Generalist Multi-AgentUse Microsoft's generalist multi-agent architecture: a single Orchestrator agent dispatches to four specialist sub-agents (WebSurfer, FileSurfer, Coder, ComputerTerminal) for solving open-ended complex tasks that span web browsing, file manipulation, code execution and shell operations.
  • complementsCrawler Dispatcher★★Route each incoming URL to a domain-specific crawler through a central dispatcher mapping URL patterns to registered crawler classes.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in recipes

Used in frameworks

Show 6 more

References

Provenance