III · Tool Use & EnvironmentExperimental·

Affordance Grounding Before Action

also known as Affordance Prompting, Feasibility Screen, Affordance Gate

Have a vision-language model ground each candidate action against the current scene and predict its affordance, so that actions the environment cannot physically support are discarded before any reach the controller.

Context

An embodied agent — a robot arm, a mobile manipulator, a GUI or device controller — plans an action from a high-level goal and a view of the scene. The planner reasons in language about what to do next, but language plans drift from what the body and the scene actually allow. A target may sit out of reach, an object may be too large for the gripper, a surface may not be graspable, or a referenced widget may not exist on screen. Executing such an action wastes a real interaction step and can leave the world in a worse state.

Problem

A language planner proposes actions from intent, not from what the scene affords, so it readily emits commands the agent cannot carry out: grasp an object beyond reach, place on a surface that does not exist, click a control that is off screen. Checking feasibility only after execution is slow and sometimes destructive, while encoding every physical pre-condition by hand is brittle across scenes and embodiments. The agent needs to know, from the current perception, whether each proposed action is even possible before it spends a real step on it.

Forces

  • A language planner reasons about goals and steps but has weak grounding in the geometry, reachability, and physics of the specific scene in front of it.
  • Validating an action by executing it costs a real interaction step and can be irreversible, so failed actions are expensive.
  • Hand-coding pre-conditions per object and per embodiment does not transfer; a learned visual predictor generalises but adds latency and can mis-score.

Example

A tabletop robot is told to put the mug on the top shelf. Its planner proposes grasping the mug and placing it on the shelf. Before either runs, a vision-language model looks at the camera image and scores each action: the grasp is feasible because the mug is in reach and gripper-sized, but the place is scored infeasible because the top shelf is above the arm's reach. The place action is filtered out, and the planner is asked to revise toward a reachable shelf instead of wasting a move the arm could never complete.

Diagram

Solution

Therefore:

For each candidate action the planner proposes, render a grounded query to a vision-language model that pairs the action with the current scene image and asks whether the agent's body can perform it here — is the target reachable, graspable, clickable, large enough, on a valid surface. The model returns an affordance score or a yes/no feasibility judgement, optionally with the grounded location. Candidates that fall below the threshold are filtered out and the planner is asked to revise; candidates that pass are forwarded to the low-level controller for execution. The check is pure perception: it reads the scene as it is and predicts feasibility, without rolling out the action's downstream consequences or maintaining a simulator of the environment.

What this pattern forbids. Only actions the scene affords reach the controller; the agent may not execute a candidate action until the vision-language affordance check passes the feasibility threshold, and a candidate scored below threshold must be filtered or revised rather than attempted.

And the patterns that stand alongside it, or against it —

  • complementsSimulate Before ActuateBefore issuing an irreversible action, run a deterministic simulation that computes pre-conditions, invariants, and expected deltas; require a verifier — automated or human — to green-light the simulated outcome before the real command is sent.
  • alternative-toWorld Model as Tool·Let a planning agent invoke a generative world model as a tool to roll out hypothetical futures before committing to an action, treating the world model as a callable simulator rather than a training target.
  • alternative-toMental-Model-In-The-Loop Simulator·Run candidate multi-step strategies inside an internal simulator of the environment before committing in the real world — broader than simulate-before-actuate (single action) by simulating multi-step strategies.
  • complementsCanonical-Entity GroundingRequire the agent to resolve every business identifier it uses — SKU, account, supplier, customer — through an authoritative lookup against the system of record, rather than emitting the identifier from the model's parametric memory.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.