Proactive Goal Creator
also known as Multimodal Goal Anticipator, Context-Capturing Goal Creator
Anticipate the user's goal by capturing surrounding multimodal context (gestures, screen state, environment) in addition to what the user types or says.
This pattern helps complete certain larger patterns —
- used-byPrompt/Response Optimiser★★— At runtime, transform user inputs and model outputs into standardised, template-aligned prompts and responses against predefined constraints, so the agent and its downstream consumers see consistent shapes.
Context
A team builds an agent for a setting where the user cannot or will not articulate the full context in text — an accessibility tool used by someone with limited speech, an ambient home assistant, an embodied robot, a screen-aware coding helper. Cameras, microphones, screen capture, or other sensors are available and can supply context the user does not state. The team has the operational and privacy approvals to capture and process that data.
Problem
If the agent only listens to the user's typed or spoken prompt, it misses the gesture pointing at the object, the screen state the user is looking at, the ambient activity the user assumes is obvious. The user is then forced either to over-articulate (typing what they are already pointing at) or to accept wrong answers. Naively piping raw sensor streams into the planner overwhelms downstream components with multimodal data they cannot use directly. The team needs a component that captures and synthesises the relevant non-verbal context into a structured goal before planning begins.
Forces
- Underspecification: users may be unable or unwilling to verbalise full context.
- Accessibility: users with motor or speech impairments cannot rely on dialogue alone.
- Overhead: multimodal capture adds cost (sensors, bandwidth, privacy review).
Example
A user points at an object on their desk and says "can you order another one of these". A proactive goal creator captures the camera frame, recognises the object, combines that with the spoken request, and emits a goal: "reorder the visible model of headphones for the user's default address". The user never had to type a SKU.
Diagram
Solution
Therefore:
A proactive goal creator runs alongside the dialogue interface. It activates context-capture devices (cameras for gestures, screen recorders for UI state, microphones for ambient audio, environment sensors), passes the multimodal data through context engineering, and combines it with the user's articulated prompt to produce a refined goal. The component must notify users when context is being captured, with a low false-positive rate, to avoid surprise.
What this pattern forbids. Multimodal capture must be disclosed to the user; downstream planning may not consume raw sensor streams — only the synthesised goal.
And the patterns that stand alongside it, or against it —
- alternative-toPassive Goal Creator★— Analyse the user's articulated prompts and accompanying context to derive a precise, actionable goal before any planning or tool use begins.
- complementsInput/Output Guardrails★★— Validate inputs before they reach the model and outputs before they reach the user.
- complementsComputer Use★— Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.