Introduction to Google’s Research on User Intent
Google has published a research paper on a new method for extracting user intent from interactions on mobile devices and browsers. The goal is to enable autonomous agents to understand what a user is trying to do, without compromising their privacy. The method uses small models that run on the device, eliminating the need to send data back to Google.
How the Method Works
The researchers discovered that by splitting the problem into two tasks, they could achieve superior performance compared to large language models. The first stage involves summarizing the user’s actions on the device, while the second stage identifies the user’s intent based on these summaries. This approach allows the processing to happen on the device, keeping the user’s data private.
Smaller Models on Browsers and Devices
The focus of the research is on identifying user intent through a series of actions on a mobile device or browser. The model on the device summarizes what the user is doing, and the sequence of summaries is then sent to a second model that identifies the user’s intent. This approach demonstrates superior performance compared to both smaller models and large language models, independent of dataset and model type.
Intent Extraction from UI Interactions
The researchers used a technique called intent extraction from screenshots and text descriptions of user interactions, which was proposed in 2025. They improved upon this approach by using a two-stage method, where the first stage summarizes the user’s actions, and the second stage identifies the user’s intent. The user journey, or trajectory, is represented as a sequence of interactions, including observations (screenshots) and actions (user interactions).
Challenges in Evaluating Extracted Intents
Evaluating extracted intents is a challenging task, as user intents contain complex details and are inherently subjective. The researchers explain that grading extracted intent is difficult because user intents contain ambiguities, making it a hard problem to solve. For example, did a user choose a product because of the price or the features? The actions are visible, but the motivations are not.
Two-Stage Approach
The researchers chose a two-stage approach that emulates Chain of Thought reasoning. The first stage generates a summary for each interaction, and the second stage generates an overall intent description. The first stage uses prompting to generate a summary, and the second stage applies fine-tuning to generate the intent description.
The First Stage: Screenshot Summary
The first stage involves dividing the summary into two parts: a description of what’s on the screen and a description of the user’s action. The researchers also used a third component, speculative intent, which is a way to get rid of speculation about the user’s intent. Surprisingly, allowing the model to speculate and then getting rid of that speculation leads to a higher quality result.
The Second Stage: Generating Overall Intent Description
The second stage involves fine-tuning a model to generate an overall intent description. The model is trained on summaries that represent all interactions in the trajectory and the matching ground truth that describes the overall intent for each trajectory. The researchers solved the problem of the model hallucinating by refining the target intents to remove details that aren’t reflected in the input summaries.
Ethical Considerations and Limitations
The research paper ends by summarizing potential ethical issues, where an autonomous agent might take actions that are not in the user’s interest. The authors also acknowledged limitations in the research, such as the testing being done only on Android and web environments, which may not generalize to Apple devices. The research was also limited to users in the United States in the English language.
Conclusion
The research paper demonstrates a new method for extracting user intent from interactions on mobile devices and browsers, without compromising user privacy. The two-stage approach shows superior performance compared to large language models, and the method has the potential to be used in various applications, such as proactive assistance and personalized memory. While the research is still in its early stages, it shows the direction that Google is heading, where small models on devices will be watching user interactions and sometimes stepping in to assist users based on their intent.

