III · Tool Use & EnvironmentEmerging

Policy-Localizer-Validator

also known as Three-Way GUI Agent, Surfer-H Architecture, Validator-Gated Browser Agent

Split a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.

This pattern helps complete certain larger patterns —

  • specialisesDual-System GUI AgentSplit a GUI agent into a decision model that plans and recovers from errors and a grounding model that observes pixels and emits the precise action; route each subproblem to the better-suited model.
  • specialisesBrowser AgentExpose websites to the agent through a structured DOM/accessibility tree plus a small action vocabulary, sitting between raw HTML and pixel-level Computer Use.
  • specialisesComputer UseLet the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.

Context

A team is operating a browser or desktop agent that reads screenshots and emits clicks, types, and scrolls. Trajectories are long, costs compound at each step, and per-step latency matters for real-time web use. The team wants to attribute failures cleanly and to size each capability with the smallest sufficient model.

Problem

One large multimodal model that plans, grounds clicks to pixels, and decides when to stop pays the largest-model price on every step, including the steps where it is really just doing perception. Failures cannot be attributed cleanly: a wrong click could be a bad plan, bad pixel grounding, or a premature stop. A two-model split that separates planning from grounding (the Dual-System approach) helps with the first two but still leaves the commit decision implicit in whatever the planner happened to say last, with no independent check that the task actually finished.

Forces

  • Planning, grounding, and completion-judgment have different optimal model sizes.
  • Pixel-precise grounding is a perception problem; large reasoning models overpay for it.
  • Completion judgment must be uncorrelated with the planner or it just rubber-stamps its own work.
  • Costs compound per step in long browser trajectories.
  • Latency on every action matters for real-time web use, so each role must be independently latency-tuned.

Example

A booking agent must reserve a meeting room on an internal portal. Policy reads the screenshot and says 'click the Book button next to the 10 AM slot'. Localizer VLM, trained on UI grounding, returns coordinates (892, 437). After the click, Validator sees a confirmation modal and judges 'task complete, confidence 0.92'. When grounding once misfires — Localizer clicks the 11 AM Book button — the Validator catches the wrong confirmation slot and signals 'failed, retry'; the loop continues with corrected context.

Diagram

Solution

Therefore:

Pipeline each step through three models. Policy LLM reads the current screenshot plus task state and emits a textual action ("click the Sign In button in the top-right"). Localizer VLM, trained specifically for UI grounding, takes that description plus the screenshot and returns pixel coordinates. The action is executed. Validator VLM — separately trained on completion judgments — inspects the resulting screenshot and answers "task complete?" with calibrated confidence; if uncertain, the loop continues; if confident-complete, the agent halts; if confident-failed, the agent retries or escalates. Each model can be sized independently — typically Policy is the largest, Localizer is a small specialist VLM, Validator is mid-sized.

What this pattern forbids. The Policy model must not emit pixel coordinates directly — grounding is the Localizer's exclusive responsibility. The agent must not commit to task-complete based on the Policy model's own output; only the Validator can stop the loop.

And the patterns that stand alongside it, or against it —

  • alternative-toEvaluator-Optimizer★★One LLM generates; another evaluates and feeds back; loop until criteria are met.
  • alternative-toTool-Augmented Self-CorrectionSelf-correct LLM outputs by interactively critiquing them with external tools (search, code execution, calculator).

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.