Planning & Control Flow

Proactive Goal Creator

Anticipate the user's goal by capturing surrounding multimodal context (gestures, screen state, environment) in addition to what the user types or says.

Problem

If the agent only listens to the user's typed or spoken prompt, it misses the gesture pointing at the object, the screen state the user is looking at, the ambient activity the user assumes is obvious. The user is then forced either to over-articulate (typing what they are already pointing at) or to accept wrong answers. Naively piping raw sensor streams into the planner overwhelms downstream components with multimodal data they cannot use directly. The team needs a component that captures and synthesises the relevant non-verbal context into a structured goal before planning begins.

Solution

A proactive goal creator runs alongside the dialogue interface. It activates context-capture devices (cameras for gestures, screen recorders for UI state, microphones for ambient audio, environment sensors), passes the multimodal data through context engineering, and combines it with the user's articulated prompt to produce a refined goal. The component must notify users when context is being captured, with a low false-positive rate, to avoid surprise.

When to use

  • Embodied / ambient interaction is the primary surface, not chat.
  • Accessibility needs make dialogue-only interaction insufficient.
  • Context-capture is justified by clear user value and disclosed appropriately.

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related