Meet the AI That Can Actually Use Your Smartphone for You

11 Dec 2024

Authors:

(1) An Yan, UC San Diego, [email protected];

(2) Zhengyuan Yang, Microsoft Corporation, [email protected] with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, [email protected];

(4) Kevin Lin, Microsoft Corporation, [email protected];

(5) Linjie Li, Microsoft Corporation, [email protected];

(6) Jianfeng Wang, Microsoft Corporation, [email protected];

(7) Jianwei Yang, Microsoft Corporation, [email protected];

(8) Yiwu Zhong, University of Wisconsin-Madison, [email protected];

(9) Julian McAuley, UC San Diego, [email protected];

(10) Jianfeng Gao, Microsoft Corporation, [email protected];

(11) Zicheng Liu, Microsoft Corporation, [email protected];

(12) Lijuan Wang, Microsoft Corporation, [email protected].

Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.

Table of Links

Abstract and 1 Introduction
2 Related Work
3 MM-Navigator
3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
3.3 History Generation via Multimodal Self Summarization
4 iOS Screen Navigation Experiment
4.1 Experimental Setup
4.2 Intended Action Description
4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
5 Android Screen Navigation Experiment
5.1 Experimental Setup
5.2 Performance Comparison
5.3 Ablation Studies
5.4 Error Analysis
6 Discussion
7 Conclusion and References

Autonomous GUI navigation. Autonomous GUI navigation involves a model following instructions to maneuver through different graphical user interfaces, such as websites or applications, to perform the user-queried task. Current benchmarks collected either synthetic or real-world usergenerated instructions to evaluate models’ abilities in identifying specific UI elements (Shi et al., 2017; Li et al., 2020; Bai et al., 2021), or achieving overarching task objectives by interacting with a series of GUI views (Li et al., 2020; Burns et al., 2021; Venkatesh et al., 2022; Deng et al., 2023; Rawles et al., 2023). To understand the visual information from these GUI views, one line of work adopts a model structure that can process multimodal inputs (Sun et al., 2022; Redmon et al., 2016). Other methods focus on converting the UI scene text and icons into the text-only HTML format, such as single-module LLMs can process these text inputs for GUI navigation (Zhang et al., 2021; Rawles et al., 2023; Wen et al., 2023).

Multimodal agents. Recent advancements in LLMs (Brown et al., 2020; OpenAI, 2023a; Chowdhery et al., 2022; Anil et al., 2023; Touvron et al., 2023; Hoffmann et al., 2022) have catalyzed the exploration of LLM-based agent systems (Madaan et al., 2023; Shinn et al., 2023; Pan et al., 2023; Yao et al., 2022; Schick et al., 2023; Paranjape et al., 2023; Pryzant et al., 2023; Guo et al., 2023; Zhao et al., 2023; Yang et al., 2023a), which integrate reasoning logic and external tools for a variety of complex language tasks. Inspired by the success in the NLP domain, multimodal researchers delve into multimodal agents. The line of research begins with LLM-based multimodal agents (Gupta and Kembhavi, 2023; Surís et al., 2023; Wu et al., 2023; Yang* et al., 2023; Shen et al., 2023; Lu et al., 2023; Yu et al., 2023; Li et al., 2023), such as MM-ReAct (Yang* et al., 2023) for advanced visual reasoning and Visual ChatGPT (Wu et al., 2023) for iterative visual generation and editing. Propelled by the rapid advancements of LMMs (Alayrac et al., 2022; Driess et al., 2023; OpenAI, 2023a,b,c; gpt, 2023; Yang et al., 2023c; Google, 2023), the latest studies have begun to investigate the LMM-powered multimodal agents (Yang et al., 2023; Liu et al., 2023), thereby surpassing the need for basic visual description tools like caption models (Wang et al., 2022a; Wu et al., 2022). Our proposed methodology represents a specialized LMM-based agent for GUI navigation. We aim to provide a comprehensive analysis and a strong baseline for this task.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Researchers Find Clever Way to Get AI to Navigate Your Screen

Up Next →

Microsoft Researchers Say New AI Model Can 'See' Your Phone Screen

Meet the AI That Can Actually Use Your Smartphone for You

Table of Links

2 Related Work