Researchers Find Clever Way to Get AI to Navigate Your Screen

cover
11 Dec 2024

Authors:

(1) An Yan, UC San Diego, ayan@ucsd.edu;

(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;

(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;

(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;

(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;

(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;

(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;

(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;

(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;

(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;

(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.

Editor’s note: This is the part 3 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.

3 MM-Navigator

3.1 Problem Formulation

When presented with a user instruction Xinstr in natural language, the agent is asked to complete a series of actions on the smartphone to complete this instruction. The entire process of agentenvironment interactions from initial to final states is called an episode. At each time step t of an episode, the agent will be given a screenshot I t , and decide the next step action to take in order to complete the task.

3.2 Screen Grounding and Navigation via Set of Mark

GPT-4V serves as a multimodal model that takes visual images and text as inputs and produces text output. One challenge is how do we communicate with GPT-4V to perform actions on screen. A possible solution is to ask the model to reason about coordinates to click given a screen. However, based on our preliminary exploration, though GPT-4V have a good understanding of the screen and approximately where to click to perform an instruction by describing the corresponding icon or text, it appears to be bad at estimating accurate numerical coordinates.

Therefore, in this paper, we seek a new approach, to communicate with GPT-4V via Set-ofMark prompting (Yang et al., 2023b) on the screen. Specifically, given a screen, we will detect UI elements via the OCR tool and IconNet (Sunkara et al., 2022). Each element has a bounding box and either OCR-detected text or an icon class label (one of the possible 96 icon types detected by (Sunkara et al., 2022)) are contained. At each step time t, we add numeric tags to those elements, and present GPT-4V with the original screen I t and the screen with tags I t tags. The output text Yaction of GPT-4V will be conditioned on the two images. If GPT-4V decides to click somewhere on the screen, it will choose from the available numeric tags. In practice, we found this simple method works well, setting up a strong baseline for screen navigation with large multimodal models.

This paper is available on arxiv under CC BY 4.0 DEED license.