Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.
Editor’s note: This is the part 8 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.
Table of Links
- Abstract and 1 Introduction
- 2 Related Work
- 3 MM-Navigator
- 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
- 3.3 History Generation via Multimodal Self Summarization
- 4 iOS Screen Navigation Experiment
- 4.1 Experimental Setup
- 4.2 Intended Action Description
- 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
- 5 Android Screen Navigation Experiment
- 5.1 Experimental Setup
- 5.2 Performance Comparison
- 5.3 Ablation Studies
- 5.4 Error Analysis
- 6 Discussion
- 7 Conclusion and References
5 Android Screen Navigation Experiment
5.1 Experimental Setup
Dataset. We use the AITW dataset (Rawles et al., 2023) for our evaluation on Android screen navigation. AITW is a large-scale benchmark dataset for UI control, which contains natural language instructions, screenshots on different Android systems with different resolutions, and user-annotated actions. It covers diverse multi-step tasks such as various web and application operations, app installation, and tasks with Google apps, with 715K episodes and 30K unique instructions in total. Table 2 shows the basic statistics of the dataset. We follow the split from previous work (Zhan and Zhang, 2023). Following the previous experiment setting (Rawles et al., 2023) that evaluates PaLM 2 on a randomly sampled 288 episodes, we sample 300 episodes from the test split as our test set.
Metrics. Following previous work (Rawles et al., 2023; Zhan and Zhang, 2023), we compute the screen-wise partial action matching score as the main evaluation metric, defined as the number of correct actions divided by the episode length, then this score is averaged over all tested episodes. A predicted action from GPT-4V is considered correct if both the action type and gesture match the gold ones, i.e., user actions. For click actions, it is considered correct if the selected element falls within a 14% screen distance from the gold gestures or occurs within the same detected bounding box with user gestures. For scroll actions, it is considered correct if the selected direction has the same scroll direction (up, down, left, and right) as user gestures. The partial score has been shown to correlate with the task complete score estimated by human evaluations (Rawles et al., 2023) to measure the action success rate of this task.
Baselines. We compare with the following baselines (Rawles et al., 2023; Zhan and Zhang, 2023):
• PaLM-2 ZS (Rawles et al., 2023): Zero-shot performance with PaLM-2 (Anil et al., 2023), by feeding a textual description of the screen and ask it to predict an action among the supported actions in AITW. We adopt a previously proposed LLM-based design for device control (Wang et al., 2023), where the input screen description is converted to HTML syntax.
• PaLM-2 5-shot (Rawles et al., 2023): Five examples of navigation are designed as Chainof-thought prompts. The history of prior actions taken by the agent is also fed into the model input.
• ChatGPT 5-shot (Zhan and Zhang, 2023). The input prompts are of the same format as PaLM2 5-shot. Experiments are conducted via the ChatGPT API.
• Fine-tuned Llama-2 (Zhan and Zhang, 2023): Fine-tuning Llama-2 model (Touvron et al., 2023) with LoRA (Hu et al., 2021), by feeding the model with the user instruction and screen descriptions in HTML syntax (the same that are used for in-context learning LLMs) and predict user actions. The model is fine-tuned with 1% randomly sampled training data to help adapt to this task.
This paper is available on arxiv under CC BY 4.0 DEED license.