Meet the AI That Can Actually Use Your Smartphone for You

cover
11 Dec 2024

Authors:

(1) An Yan, UC San Diego, ayan@ucsd.edu;

(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;

(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;

(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;

(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;

(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;

(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;

(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;

(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;

(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;

(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.

Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.

Autonomous GUI navigation. Autonomous GUI navigation involves a model following instructions to maneuver through different graphical user interfaces, such as websites or applications, to perform the user-queried task. Current benchmarks collected either synthetic or real-world usergenerated instructions to evaluate models’ abilities in identifying specific UI elements (Shi et al., 2017; Li et al., 2020; Bai et al., 2021), or achieving overarching task objectives by interacting with a series of GUI views (Li et al., 2020; Burns et al., 2021; Venkatesh et al., 2022; Deng et al., 2023; Rawles et al., 2023). To understand the visual information from these GUI views, one line of work adopts a model structure that can process multimodal inputs (Sun et al., 2022; Redmon et al., 2016). Other methods focus on converting the UI scene text and icons into the text-only HTML format, such as single-module LLMs can process these text inputs for GUI navigation (Zhang et al., 2021; Rawles et al., 2023; Wen et al., 2023).

Multimodal agents. Recent advancements in LLMs (Brown et al., 2020; OpenAI, 2023a; Chowdhery et al., 2022; Anil et al., 2023; Touvron et al., 2023; Hoffmann et al., 2022) have catalyzed the exploration of LLM-based agent systems (Madaan et al., 2023; Shinn et al., 2023; Pan et al., 2023; Yao et al., 2022; Schick et al., 2023; Paranjape et al., 2023; Pryzant et al., 2023; Guo et al., 2023; Zhao et al., 2023; Yang et al., 2023a), which integrate reasoning logic and external tools for a variety of complex language tasks. Inspired by the success in the NLP domain, multimodal researchers delve into multimodal agents. The line of research begins with LLM-based multimodal agents (Gupta and Kembhavi, 2023; Surís et al., 2023; Wu et al., 2023; Yang* et al., 2023; Shen et al., 2023; Lu et al., 2023; Yu et al., 2023; Li et al., 2023), such as MM-ReAct (Yang* et al., 2023) for advanced visual reasoning and Visual ChatGPT (Wu et al., 2023) for iterative visual generation and editing. Propelled by the rapid advancements of LMMs (Alayrac et al., 2022; Driess et al., 2023; OpenAI, 2023a,b,c; gpt, 2023; Yang et al., 2023c; Google, 2023), the latest studies have begun to investigate the LMM-powered multimodal agents (Yang et al., 2023; Liu et al., 2023), thereby surpassing the need for basic visual description tools like caption models (Wang et al., 2022a; Wu et al., 2022). Our proposed methodology represents a specialized LMM-based agent for GUI navigation. We aim to provide a comprehensive analysis and a strong baseline for this task.

This paper is available on arxiv under CC BY 4.0 DEED license.