Abstract
Efficiently navigating through mobile applications to accomplish specific tasks can be time-consuming and challenging, particularly for users who are unfamiliar with the app or faced with intricate menu structures. Simplifying access to a particular screen is a shared user priority, especially for individuals with diverse needs, including those with specific accessibility requirements. This underscores the exploration of innovative solutions to streamline the navigation process. This work addresses the challenge of mapping natural language intents to user interfaces, with a specific focus on the context of mobile applications. The primary objective of this work is to provide users with a more intuitive and efficient method for accessing desired screens in mobile applications by expressing their intentions in natural language. Existing approaches to this task have relied heavily on qualitative human studies for evaluating the performance. Moreover, widely used pre-trained vision-language models, such as Contrastive Language-Image Pretraining (CLIP), struggle to generalize effectively to the unique visual characteristics of user interfaces. Acknowledging the limitations, we introduce a novel approach that harnesses the power of the pre-trained vision-language models. Specifically, we investigate whether fine-tuning pre-trained vision-language models on mobile screens can address the challenges posed by the intricate nature of mobile application interfaces. Our approach involves the utilization of state-of-the-art pre-trained text and image encoders and employing a supervised fine-tuning process, where pre-trained models are adapted to the specific needs of mobile screen interactions. Moreover, a shared embedding space facilitates the alignment of embeddings of both text and image modalities, fostering a cohesive understanding between the natural language intents and visual features of user interface elements. We conduct extensive experimental evaluations using the Screen2Word dataset. Through systematic analysis and established metrics, we examine the models' ability to accurately map diverse linguistic intents to specific user interfaces. Our analysis demonstrates that fine-tuning yields substantial improvements over the zero-shot performance of the pre-trained vision-language models.
Original language | English |
---|---|
Title of host publication | Proceedings - 18th IEEE International Conference on Semantic Computing, ICSC 2024 |
Pages | 237-244 |
Number of pages | 8 |
ISBN (Electronic) | 9798350385359 |
DOIs | |
State | Published - 2024 |
Event | 18th IEEE International Conference on Semantic Computing, ICSC 2024 - Hybrid, Laguna Hills, United States Duration: Feb 5 2024 → Feb 7 2024 |
Publication series
Name | Proceedings - IEEE International Conference on Semantic Computing, ICSC |
---|---|
ISSN (Print) | 2325-6516 |
ISSN (Electronic) | 2472-9671 |
Conference
Conference | 18th IEEE International Conference on Semantic Computing, ICSC 2024 |
---|---|
Country/Territory | United States |
City | Hybrid, Laguna Hills |
Period | 2/5/24 → 2/7/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- Natural language processing
- pre-trained vision-language models
- user interface navigation
ASJC Scopus subject areas
- Artificial Intelligence
- Computer Networks and Communications
- Human-Computer Interaction
- Information Systems and Management