Affordance Grounding Before Action
Have a vision-language model ground each candidate action against the current scene and predict its affordance, so that actions the environment cannot physically support are discarded before any reach the controller.
Problem
A language planner proposes actions from intent, not from what the scene affords, so it readily emits commands the agent cannot carry out: grasp an object beyond reach, place on a surface that does not exist, click a control that is off screen. Checking feasibility only after execution is slow and sometimes destructive, while encoding every physical pre-condition by hand is brittle across scenes and embodiments. The agent needs to know, from the current perception, whether each proposed action is even possible before it spends a real step on it.
Solution
For each candidate action the planner proposes, render a grounded query to a vision-language model that pairs the action with the current scene image and asks whether the agent's body can perform it here — is the target reachable, graspable, clickable, large enough, on a valid surface. The model returns an affordance score or a yes/no feasibility judgement, optionally with the grounded location. Candidates that fall below the threshold are filtered out and the planner is asked to revise; candidates that pass are forwarded to the low-level controller for execution. The check is pure perception: it reads the scene as it is and predicts feasibility, without rolling out the action's downstream consequences or maintaining a simulator of the environment.
When to use
- An embodied or device agent plans actions in language but executes against a physical or on-screen scene with reach, geometry, or existence constraints.
- Failed actions are expensive or irreversible, so screening infeasible candidates from perception is worth a per-step inference.
- A vision-language model can ground the action against the scene and predict feasibility better than hand-coded pre-conditions.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.