Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.
Editor’s note: This is the part 3 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.
Table of Links
- Abstract and 1 Introduction
- 2 Related Work
- 3 MM-Navigator
- 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
- 3.3 History Generation via Multimodal Self Summarization
- 4 iOS Screen Navigation Experiment
- 4.1 Experimental Setup
- 4.2 Intended Action Description
- 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
- 5 Android Screen Navigation Experiment
- 5.1 Experimental Setup
- 5.2 Performance Comparison
- 5.3 Ablation Studies
- 5.4 Error Analysis
- 6 Discussion
- 7 Conclusion and References
3 MM-Navigator
3.1 Problem Formulation
When presented with a user instruction Xinstr in natural language, the agent is asked to complete a series of actions on the smartphone to complete this instruction. The entire process of agentenvironment interactions from initial to final states is called an episode. At each time step t of an episode, the agent will be given a screenshot I t , and decide the next step action to take in order to complete the task.
3.2 Screen Grounding and Navigation via Set of Mark
GPT-4V serves as a multimodal model that takes visual images and text as inputs and produces text output. One challenge is how do we communicate with GPT-4V to perform actions on screen. A possible solution is to ask the model to reason about coordinates to click given a screen. However, based on our preliminary exploration, though GPT-4V have a good understanding of the screen and approximately where to click to perform an instruction by describing the corresponding icon or text, it appears to be bad at estimating accurate numerical coordinates.
Therefore, in this paper, we seek a new approach, to communicate with GPT-4V via Set-ofMark prompting (Yang et al., 2023b) on the screen. Specifically, given a screen, we will detect UI elements via the OCR tool and IconNet (Sunkara et al., 2022). Each element has a bounding box and either OCR-detected text or an icon class label (one of the possible 96 icon types detected by (Sunkara et al., 2022)) are contained. At each step time t, we add numeric tags to those elements, and present GPT-4V with the original screen I t and the screen with tags I t tags. The output text Yaction of GPT-4V will be conditioned on the two images. If GPT-4V decides to click somewhere on the screen, it will choose from the available numeric tags. In practice, we found this simple method works well, setting up a strong baseline for screen navigation with large multimodal models.
This paper is available on arxiv under CC BY 4.0 DEED license.