Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.
Editor’s note: This is the part 10 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.
Table of Links
- Abstract and 1 Introduction
- 2 Related Work
- 3 MM-Navigator
- 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
- 3.3 History Generation via Multimodal Self Summarization
- 4 iOS Screen Navigation Experiment
- 4.1 Experimental Setup
- 4.2 Intended Action Description
- 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
- 5 Android Screen Navigation Experiment
- 5.1 Experimental Setup
- 5.2 Performance Comparison
- 5.3 Ablation Studies
- 5.4 Error Analysis
- 6 Discussion
- 7 Conclusion and References
5.3 Ablation Studies
For the ablation studies, we randomly sampled 50 episodes in total from 5 categories, which is a different subset used by the main results.
Different tagging methods. We first perform an ablation study to compare the performance with different methods to add tags on screen, shown in Table 4. We consider three methods: (1) By side which adds tags with black squares (same style as (Rawles et al., 2023) by the left side of each detected icon; (2) Red which uses red circles for each tag; (3) Center which adds tags with black squares at the center of each detected box. First, adding tags by the left side of boxes may cause problems, for example, some icons may be too close to each other, hence leading to slightly worse results. For tagging styles, we didn’t find a significant difference between red cycles and black rectangles, though empirically black rectangles (Yang et al., 2023b) perform slightly better.
Different prompts. We then perform robustness check with different prompting variants: (1) Baseline: Simply ask GPT-4V to take actions; (2) Think: Prompt GPT-4V to think step by step (Kojima et al., 2022); (3) Detail: Provide more context for this task. Overall, we did not observe improvements by “thinking step by step”, but adding more task descriptions helps GPT-4V to better execute actions.
This paper is available on arxiv under CC BY 4.0 DEED license.