Researchers Successfully Develop AI Model That Can Handle Everyday Tasks on Your iPhone

cover
11 Dec 2024

Authors:

(1) An Yan, UC San Diego, ayan@ucsd.edu;

(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;

(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;

(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;

(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;

(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;

(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;

(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;

(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;

(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;

(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.

Editor’s note: This is the part 6 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.

4.2 Intended Action Description

Table 1 reports an accuracy of 90.9% on generating the correct intended action description, quantitatively supporting GPT-4V’s capability in understanding the screen actions to perform (Yang et al., 2023c; Lin et al., 2023). Figure 1 showcases representative screen understanding examples. Given a screen and a text instruction, GPT-4V gives a text description of its intended next move. For example, in Figure 1(a), GPT-4V understands the Safari browser limits of “the limit of 500 open tabs,” and suggests “Try closing a few tabs and then see if the "+" button becomes clickable.” Another example is telling the procedure for iOS update: “You should click on "General" and then look for an option labeled "Software Update” in (b). GPT-4V also effectively understands complicated screens with multiple images and icons. For example, in (c), GPT-4V mentions, “For information on road closures and other alerts at Mt. Rainier, you should click on "6 Alerts" at the top of the screen.” Figure 1(d) gives an example in online shopping, where GPT-4V suggests the correct product to check based on the user input of the desired “wet cat food.”

Figure 2: Localized action execution examples. Best viewed by zooming in on the screen.

Figure 3: Representative failure cases in iOS screen navigation. Best viewed by zooming in on the screen.

This paper is available on arxiv under CC BY 4.0 DEED license.