Tool Use & Environment

Policy-Localizer-Validator

Split a GUI agent into three specialist models — a Policy that plans, a Localizer that grounds elements to pixels, and a Validator that judges completion — so each role uses the smallest sufficient model.

Problem

One large multimodal model that plans, grounds clicks to pixels, and decides when to stop pays the largest-model price on every step, including the steps where it is really just doing perception. Failures cannot be attributed cleanly: a wrong click could be a bad plan, bad pixel grounding, or a premature stop. A two-model split that separates planning from grounding (the Dual-System approach) helps with the first two but still leaves the commit decision implicit in whatever the planner happened to say last, with no independent check that the task actually finished.

Solution

Pipeline each step through three models. Policy LLM reads the current screenshot plus task state and emits a textual action ("click the Sign In button in the top-right"). Localizer VLM, trained specifically for UI grounding, takes that description plus the screenshot and returns pixel coordinates. The action is executed. Validator VLM — separately trained on completion judgments — inspects the resulting screenshot and answers "task complete?" with calibrated confidence; if uncertain, the loop continues; if confident-complete, the agent halts; if confident-failed, the agent retries or escalates. Each model can be sized independently — typically Policy is the largest, Localizer is a small specialist VLM, Validator is mid-sized.

When to use

  • Agent drives a GUI or browser via screenshots and actions.
  • Trajectories are long enough that per-step cost matters.
  • Failure-mode attribution is needed for debugging or audit.
  • Open-weights specialist VLMs are available or trainable for the target domain.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related